### Task 1: Basic Data Profiling of a CSV File
**Description**: Load a CSV file and generate a Pandas-Profiling report.

**Steps**:
1. Load a CSV File: Make sure you have a CSV file (e.g., data.csv ). Load it using pandas.
2. Generate a Profile Report.

In [16]:
import pandas as pd
from ydata_profiling import ProfileReport # Changed import statement
import os

# --- Configuration ---
csv_file_path = 'data.csv'
output_report_html = 'data_profile_report.html'

# --- Step 1: Load a CSV File ---
try:
    df = pd.read_csv(csv_file_path)
    print(f"Successfully loaded '{csv_file_path}' into a DataFrame.")
    print("DataFrame Head:")
    print(df.head())
    print("\nDataFrame Info:")
    df.info()
    print("-" * 50)

except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please ensure 'data.csv' is in the same directory as this script, or provide its full path.")
    print("\nExample 'data.csv' content to create:")
    print("```csv")
    print("ID,Name,Age,City,Salary,HasChildren,LastLogin")
    print("1,Alice,30,New York,60000,True,2024-01-15")
    print("2,Bob,24,London,45000,False,2024-02-20")
    print("3,Charlie,35,Paris,75000,True,2024-03-10")
    print("4,David,,Berlin,50000,False,2024-04-01")
    print("5,Eve,29,New York,80000,True,2024-05-20")
    print("6,Frank,40,,65000,False,2024-01-05")
    print("7,Grace,22,London,,True,2024-02-12")
    print("```")
    exit() # Exit if file not found

except pd.errors.EmptyDataError:
    print(f"Error: The file '{csv_file_path}' is empty.")
    exit()

except Exception as e:
    print(f"An unexpected error occurred while loading the CSV: {e}")
    exit()

# --- Step 2: Generate a Profile Report ---
print(f"Generating YData-Profiling report for '{csv_file_path}'...")

# Create the profile report object
profile = ProfileReport(df, title="Data Profiling Report", explorative=True)

# Save the report to an HTML file
profile.to_file(output_report_html)

print(f"Report generated successfully! Check '{output_report_html}' in your current directory.")
print(f"Current working directory: {os.getcwd()}")

Successfully loaded 'data.csv' into a DataFrame.
DataFrame Head:
   ID     Name   Age      City   Salary  HasChildren   LastLogin
0   1    Alice  30.0  New York  60000.0         True  2024-01-15
1   2      Bob  24.0    London  45000.0        False  2024-02-20
2   3  Charlie  35.0     Paris  75000.0         True  2024-03-10
3   4    David   NaN    Berlin  50000.0        False  2024-04-01
4   5      Eve  29.0  New York  80000.0         True  2024-05-20

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           7 non-null      int64  
 1   Name         7 non-null      object 
 2   Age          6 non-null      float64
 3   City         6 non-null      object 
 4   Salary       6 non-null      float64
 5   HasChildren  7 non-null      bool   
 6   LastLogin    7 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(3)
m

100%|██████████| 7/7 [00:00<00:00, 879.47it/s]0<00:00, 80.91it/s, Describe variable: LastLogin]  
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'putmask: first argument must be an array')
Summarize dataset: 100%|██████████| 26/26 [00:01<00:00, 21.32it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.37s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  4.56it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 607.69it/s]

Report generated successfully! Check 'data_profile_report.html' in your current directory.
Current working directory: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Automating Data Quality Measurement





### Task 2: Understanding Missing Values with Pandas-Profiling

**Description**: Identify missing values in your dataset using pandas-profiling.

**Steps**: 
1. Generate a Profile Report to Analyze Missing Values


In [17]:
# Write your code from here
import pandas as pd
from ydata_profiling import ProfileReport # Make sure this is ydata_profiling
import os

# --- Configuration ---
csv_file_path = 'data.csv' # Ensure this file exists
output_report_html = 'data_profile_report.html'

# --- Step 1: Load a CSV File ---
try:
    df = pd.read_csv(csv_file_path)
    print(f"Successfully loaded '{csv_file_path}' into a DataFrame.")
    print("DataFrame Head:")
    print(df.head())
    print("\nDataFrame Info:")
    df.info()
    print("-" * 50)

except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please ensure 'data.csv' is in the same directory as this script, or provide its full path.")
    exit()

except pd.errors.EmptyDataError:
    print(f"Error: The file '{csv_file_path}' is empty.")
    exit()

except Exception as e:
    print(f"An unexpected error occurred while loading the CSV: {e}")
    exit()

# --- Step 2: Generate a Profile Report to Analyze Missing Values ---
print(f"Generating YData-Profiling report for '{csv_file_path}'...")

profile = ProfileReport(df, title="Data Profiling Report", explorative=True)

# Save the report to an HTML file
profile.to_file(output_report_html)

print(f"Report generated successfully! Open '{output_report_html}' in your web browser to analyze missing values.")
print(f"Current working directory: {os.getcwd()}")

Successfully loaded 'data.csv' into a DataFrame.
DataFrame Head:
   ID     Name   Age      City   Salary  HasChildren   LastLogin
0   1    Alice  30.0  New York  60000.0         True  2024-01-15
1   2      Bob  24.0    London  45000.0        False  2024-02-20
2   3  Charlie  35.0     Paris  75000.0         True  2024-03-10
3   4    David   NaN    Berlin  50000.0        False  2024-04-01
4   5      Eve  29.0  New York  80000.0         True  2024-05-20

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           7 non-null      int64  
 1   Name         7 non-null      object 
 2   Age          6 non-null      float64
 3   City         6 non-null      object 
 4   Salary       6 non-null      float64
 5   HasChildren  7 non-null      bool   
 6   LastLogin    7 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(3)
m

100%|██████████| 7/7 [00:00<00:00, 265.77it/s]0<00:00, 81.53it/s, Describe variable: LastLogin]  
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'putmask: first argument must be an array')
Summarize dataset: 100%|██████████| 26/26 [00:01<00:00, 18.68it/s, Completed]                
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.42s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  4.75it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 576.54it/s]

Report generated successfully! Open 'data_profile_report.html' in your web browser to analyze missing values.
Current working directory: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Automating Data Quality Measurement





### Task 3: Analyze Data Types Using Pandas-Profiling
**Description**: Use Pandas-Profiling to analyze and check data types of your dataset.

In [18]:
# Write your code from here
import pandas as pd
from ydata_profiling import ProfileReport # Make sure this is ydata_profiling
import os

# --- Configuration ---
csv_file_path = 'data.csv' # Ensure this file exists
output_report_html = 'data_profile_report.html'

# --- Step 1: Load a CSV File ---
try:
    df = pd.read_csv(csv_file_path)
    print(f"Successfully loaded '{csv_file_path}' into a DataFrame.")
    print("DataFrame Head:")
    print(df.head())
    print("\nDataFrame Info:")
    df.info() # This also gives a quick textual overview of dtypes
    print("-" * 50)

except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please ensure 'data.csv' is in the same directory as this script, or provide its full path.")
    exit()

except pd.errors.EmptyDataError:
    print(f"Error: The file '{csv_file_path}' is empty.")
    exit()

except Exception as e:
    print(f"An unexpected error occurred while loading the CSV: {e}")
    exit()

# --- Step 2: Generate a Profile Report to Analyze Data Types ---
print(f"Generating YData-Profiling report for '{csv_file_path}'...")

profile = ProfileReport(df, title="Data Profiling Report", explorative=True)

# Save the report to an HTML file
profile.to_file(output_report_html)

print(f"Report generated successfully! Open '{output_report_html}' in your web browser to analyze data types.")
print(f"Current working directory: {os.getcwd()}")

Successfully loaded 'data.csv' into a DataFrame.
DataFrame Head:
   ID     Name   Age      City   Salary  HasChildren   LastLogin
0   1    Alice  30.0  New York  60000.0         True  2024-01-15
1   2      Bob  24.0    London  45000.0        False  2024-02-20
2   3  Charlie  35.0     Paris  75000.0         True  2024-03-10
3   4    David   NaN    Berlin  50000.0        False  2024-04-01
4   5      Eve  29.0  New York  80000.0         True  2024-05-20

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           7 non-null      int64  
 1   Name         7 non-null      object 
 2   Age          6 non-null      float64
 3   City         6 non-null      object 
 4   Salary       6 non-null      float64
 5   HasChildren  7 non-null      bool   
 6   LastLogin    7 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(3)
m

100%|██████████| 7/7 [00:00<00:00, 269.73it/s]0<00:00, 108.87it/s, Describe variable: LastLogin] 
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'putmask: first argument must be an array')
Summarize dataset: 100%|██████████| 26/26 [00:01<00:00, 23.26it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.45s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  5.26it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 633.77it/s]

Report generated successfully! Open 'data_profile_report.html' in your web browser to analyze data types.
Current working directory: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Automating Data Quality Measurement





### Task 4: Detect Unique Values and Duplicates
**Description**: Use Pandas-Profiling to detect unique values and duplicates in your dataset.

In [None]:
# Write your code from here
import pandas as pd
from ydata_profiling import ProfileReport # Make sure this is ydata_profiling
import os

# --- Configuration ---
csv_file_path = 'data.csv' # Ensure this file exists
output_report_html = 'data_profile_report.html'

# --- Step 1: Load a CSV File ---
try:
    df = pd.read_csv(csv_file_path)
    print(f"Successfully loaded '{csv_file_path}' into a DataFrame.")
    print("DataFrame Head:")
    print(df.head())
    print("\nDataFrame Info:")
    df.info()
    print("-" * 50)

except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please ensure 'data.csv' is in the same directory as this script, or provide its full path.")
    exit()

except pd.errors.EmptyDataError:
    print(f"Error: The file '{csv_file_path}' is empty.")
    exit()

except Exception as e:
    print(f"An unexpected error occurred while loading the CSV: {e}")
    exit()
# --- Step 2: Generate a Profile Report to Detect Unique Values and Duplicates ---
print(f"Generating YData-Profiling report for '{csv_file_path}'...")
# The 'explorative=True' argument (or default settings) enables detection of duplicates and unique values.
profile = ProfileReport(df, title="Data Profiling Report", explorative=True)
# Save the report to an HTML file
profile.to_file(output_report_html)

print(f"Report generated successfully! Open '{output_report_html}' in your web browser to analyze unique values and duplicates.")
print(f"Current working directory: {os.getcwd()}")

Successfully loaded 'data.csv' into a DataFrame.
DataFrame Head:
   ID     Name   Age      City   Salary  HasChildren   LastLogin
0   1    Alice  30.0  New York  60000.0         True  2024-01-15
1   2      Bob  24.0    London  45000.0        False  2024-02-20
2   3  Charlie  35.0     Paris  75000.0         True  2024-03-10
3   4    David   NaN    Berlin  50000.0        False  2024-04-01
4   5      Eve  29.0  New York  80000.0         True  2024-05-20

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           7 non-null      int64  
 1   Name         7 non-null      object 
 2   Age          6 non-null      float64
 3   City         6 non-null      object 
 4   Salary       6 non-null      float64
 5   HasChildren  7 non-null      bool   
 6   LastLogin    7 non-null      object 
dtypes: bool(1), float64(2), int64(1), object(3)
m

100%|██████████| 7/7 [00:00<00:00, 301.90it/s]0<00:00, 130.18it/s, Describe variable: LastLogin]  
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'putmask: first argument must be an array')
Summarize dataset: 100%|██████████| 26/26 [00:01<00:00, 23.91it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.37s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  4.61it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 463.41it/s]

Report generated successfully! Open 'data_profile_report.html' in your web browser to analyze unique values and duplicates.
Current working directory: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Automating Data Quality Measurement



