## Hands-on Lab: Generative AI for Data Preparation
### Estimated time: 30 minutes
### In practice, the initial part of a data science workflow involves cleaning and preparing data for better analysis. This part usually requires the removal of blank entries, normalization of numerical attributes, numerical interpretation of categorical variables, and so on. In this lab, you will use a generative AI model to create a Python code to perform all the required tasks on a real-world data set.

## Learning objectives
## In this lab, you will learn how to use generative AI for creating a Python code to:

### Handle missing values in the data set
### Correct the data type for the required data set attributes
### Perform standardization and normalization on required parameters
### Convert categorical data into numerical indicator variables

In [1]:
import pandas as pd

# Specify the file path
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"

# Read the CSV file into a Pandas data frame
df = pd.read_csv(file_path)

print(df)

# Assuming the first rows of the file are the headers, you don't need to specify any additional parameters

# Additional details:
# - The `pd.read_csv()` function is used to read a CSV file into a Pandas data frame.
# - By default, it assumes that the first row of the file contains the headers for the data.
# - If your file doesn't have headers, you can specify `header=None` as an additional parameter.
# - You can also specify other parameters, such as `sep` to specify the delimiter used in the file.
# - Make sure you have the Pandas library installed in your Python environment before running this code.

     Unnamed: 0 Manufacturer  Category     Screen  GPU  OS  CPU_core  \
0             0         Acer         4  IPS Panel    2   1         5   
1             1         Dell         3    Full HD    1   1         3   
2             2         Dell         3    Full HD    1   1         7   
3             3         Dell         4  IPS Panel    2   1         5   
4             4           HP         4    Full HD    2   1         7   
..          ...          ...       ...        ...  ...  ..       ...   
233         233       Lenovo         4  IPS Panel    2   1         7   
234         234      Toshiba         3    Full HD    2   1         5   
235         235       Lenovo         4  IPS Panel    2   1         5   
236         236       Lenovo         3    Full HD    3   1         5   
237         237      Toshiba         3    Full HD    2   1         5   

     Screen_Size_cm  CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_kg  Price  
0            35.560            1.6       8             2

In [5]:
#original file download
# filepath ='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_2/data/Spacex.csv'
# df_original = pd.read_csv(filepath, header=0)
file_path = r"C:\Users\Raegan\Downloads\GenerativeAI_DataPreperation_LaptopPricing.csv"  #remname file
try:
    df.to_csv(file_path, index=False)
    print(f'DataFrame saved to {file_path}')
except Exception as e:
    print(f'An error occurred: {e}')

DataFrame saved to C:\Users\Raegan\Downloads\GenerativeAI_DataPreperation_LaptopPricing.csv


In [25]:
import pandas as pd

# Identify columns with missing values
columns_with_missing_values = df.columns[df.isnull().any()]
missing_values_count = df.isnull().sum()[columns_with_missing_values]

df.columns
print(df.dtypes)
print(columns_with_missing_values)
missing_values_info = pd.DataFrame({'Column': columns_with_missing_values, 'Missing Values Count': missing_values_count})
print(missing_values_info)

# Additional details:
# - The `df.isnull()` function returns a Boolean data frame where each cell is True if it contains a missing value (NaN), and False otherwise.
# - The `df.columns` attribute returns the column labels of the data frame.
# - The `.any()` method returns a Boolean Series indicating whether any value in the given axis (in this case, columns) is True.
# - Finally, the `.columns` attribute is used to retrieve the column labels where the condition is True.

# You can now use the 'columns_with_missing_values' variable to further analyze or handle the columns with missing values.

Unnamed: 0          int64
Manufacturer       object
Category            int64
Screen             object
GPU                 int64
OS                  int64
CPU_core            int64
Screen_Size_cm    float64
CPU_frequency     float64
RAM_GB              int64
Storage_GB_SSD      int64
Weight_kg         float64
Price               int64
dtype: object
Index(['Screen_Size_cm', 'Weight_kg'], dtype='object')
                        Column  Missing Values Count
Screen_Size_cm  Screen_Size_cm                     4
Weight_kg            Weight_kg                     5


In [29]:
# Replace missing values in the 'Screen_Size_cm' column with the most frequent value
most_frequent_value = df['Screen_Size_cm'].mode()[0]
# df['Screen_Size_cm'].fillna(most_frequent_value, inplace=True) in place true depreciated
df['Screen_Size_cm'] = df['Screen_Size_cm'].fillna(most_frequent_value)

# Replace missing values in the 'Weight_kg' column with the mean value
mean_value = df['Weight_kg'].mean()
#df['Weight_kg'].fillna(mean_value, inplace=True) #in place true depreciated
df['Weight_kg'] = df['Weight_kg'].fillna(mean_value)

# Additional details:
# - The `.mode()` method is used to calculate the most frequent value in a column.
# - The `[0]` indexing is used to retrieve the most frequent value from the resulting Series.
# - The `.fillna()` method is used to replace missing values with a specified value.
# - The `inplace=True` parameter is used to modify the original data frame instead of creating a new one.

In [31]:
# Change the data type of 'Screen_Size_cm' and 'Weight_kg' to float - already float
df['Screen_Size_cm'] = df['Screen_Size_cm'].astype(float)
df['Weight_kg'] = df['Weight_kg'].astype(float)

# Additional details:
# - The `.astype()` method is used to change the data type of a column.
# - In this case, we're specifying `float` as the desired data type.
# - Make sure the columns contain numeric values that can be converted to float.
# - If there are any non-numeric values in the columns, the conversion will raise an error.
# You can now use the modified 'df' data frame, which has the data types of 'Screen_Size_cm' and 'Weight_kg' changed to float.

In [33]:
# Normalize the content under 'CPU_frequency' with respect to its maximum value
max_value = df['CPU_frequency'].max()
df['CPU_frequency'] = df['CPU_frequency'] / max_value

# Additional details:
# - The code calculates the maximum value of the 'CPU_frequency' attribute using the `.max()` method.
# - It then divides the values under 'CPU_frequency' by the maximum value to normalize them.
# - The resulting normalized values overwrite the original values in the 'CPU_frequency' attribute.
# You can now use the modified 'df' data frame, which has the content under the 'CPU_frequency' attribute normalized.

In [35]:
# Convert the 'Screen' attribute into indicator variables - since it is an object 
df1 = pd.get_dummies(df['Screen'], prefix='Screen')

# Append df1 into the original data frame df
df = pd.concat([df, df1], axis=1)

# Drop the original 'Screen' attribute from the data frame
df.drop('Screen', axis=1, inplace=True)

# Additional details:
# - The `pd.get_dummies()` function is used to convert a categorical attribute into indicator variables.
# - The resulting indicator variables are stored in a new data frame named 'df1'.
# - The `prefix` parameter is used to specify the naming convention for the indicator variables.
# - The `pd.concat()` function is used to concatenate the original data frame 'df' and 'df1' along the column axis (axis=1).
# - The resulting concatenated data frame is assigned back to 'df'.
# - Finally, the `.drop()` method is used to drop the original 'Screen' attribute from 'df'.
# You can now use the modified 'df' data frame, which has the 'Screen' attribute converted into indicator variables, appended, and the original attribute dropped.

## Practice

In [38]:
#Create a prompt to generate a Python code that converts the values under Price from USD to Euros.
# Conversion rate from USD to Euros
conversion_rate = 0.85  # Example conversion rate

# Convert Price from USD to Euros
df['Price_EUR'] = df['Price'] * conversion_rate

#Modify the normalization prompt to perform min-max normalization on the CPU_frequency parameter.

# Perform min-max normalization on the 'CPU_frequency' column
df['CPU_frequency_normalized'] = (df['CPU_frequency'] - df['CPU_frequency'].min()) / (df['CPU_frequency'].max() - df['CPU_frequency'].min())

# Show the updated DataFrame
print(df.head())


   Unnamed: 0 Manufacturer  Category  GPU  OS  CPU_core  Screen_Size_cm  \
0           0         Acer         4    2   1         5          35.560   
1           1         Dell         3    1   1         3          39.624   
2           2         Dell         3    1   1         7          39.624   
3           3         Dell         4    2   1         5          33.782   
4           4           HP         4    2   1         7          39.624   

   CPU_frequency  RAM_GB  Storage_GB_SSD  Weight_kg  Price  Screen_Full HD  \
0       0.551724       8             256       1.60    978           False   
1       0.689655       4             256       2.20    634            True   
2       0.931034       8             256       2.20    946            True   
3       0.551724       8             128       1.22   1244           False   
4       0.620690       8             256       1.91    837            True   

   Screen_IPS Panel  Price_EUR  CPU_frequency_normalized  
0              True  