# Data Preparation in Data Science
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed.

# Pandas
Pandas is a software library written for Python. It is very famous in the data science community because it offers powerful, expressive, and flexible data structures that make data manipulation, analysis easy.

In [1]:
#Installing the library
pip install pandas

Note: you may need to restart the kernel to use updated packages.


# Importing Library
It is an essential step to import libraries before starting any process

In [1]:
#Importing Pandas in Python
import pandas as pd

# Loading Data
The first step in Data Preparation is loading the dataset into a tool or framework where it can be manipulated and analyzed.

In [2]:
#Loading the Dataset in Pandas
df = pd.read_csv("data.csv", header=0, sep=",")

# Data Cleaning
After loading the data, it is essential to inspect it to understand its structure and identify potential issues.

In [4]:
# Inspect the first few rows of the dataset
print("Preview of Dataset:")
print(df.head())

Preview of Dataset:
          Car       Model  Volume  Weight  CO2
0      Toyoty        Aygo    1000     790   99
1  Mitsubishi  Space Star    1200    1160   95
2       Skoda      Citigo    1000     929   95
3        Mini      Cooper    1500    1140  105
4          VW         Up!    1000     929  105


In [5]:
# Check the structure and details of the dataset
print("\nDataset Information:")
print(df.info())


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Car     31 non-null     object
 1   Model   31 non-null     object
 2   Volume  31 non-null     int64 
 3   Weight  31 non-null     int64 
 4   CO2     31 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 1.3+ KB
None


# Handling Missing Values
Missing data can cause issues in analysis. Start by identifying and handling missing values.

In [6]:
# Check for missing values in the dataset
print("\nMissing Values in Each Column:")
print(df.isnull().sum())


Missing Values in Each Column:
Car       0
Model     0
Volume    0
Weight    0
CO2       0
dtype: int64


# Removing Rows
It removes any row from the DataFrame df where at least one value is missing (NaN).

In [7]:
# Remove rows with missing values
df.dropna(axis=0, inplace=True)

print("\nDataset After Removing Missing Values:")
print(df.info())


Dataset After Removing Missing Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Car     31 non-null     object
 1   Model   31 non-null     object
 2   Volume  31 non-null     int64 
 3   Weight  31 non-null     int64 
 4   CO2     31 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 1.3+ KB
None


# Handling Duplicate Data
Pandas makes it easy to identify and remove duplicate rows from the dataset.

In [8]:
# Remove duplicate rows
df1=df.drop_duplicates(inplace=True)
print (df1)

None


# Data Transformation
1. Ensure all columns have appropriate data types to avoid calculation errors.

In [9]:
# Inspect current data types
print("\nCurrent Data Types:")
print(df.dtypes)


Current Data Types:
Car       object
Model     object
Volume     int64
Weight     int64
CO2        int64
dtype: object


2. Ensure Numeric Columns Have Correct Data Types
Since Volume and Weight are already int64, no changes are needed for them. However, if they were stored as strings (e.g., object), you would need to convert them to integers or floats.

In [10]:
# Example: Convert 'Volume' and 'Weight' to numeric types (if needed)
df["Volume"] = df["Volume"].astype(int)
df["Weight"] = df["Weight"].astype(int)

3. Verify Data Type Consistency for Categorical Columns
Categorical columns such as Car and Model should be treated as object or category. Converting them to category can save memory and improve performance in analysis.

In [11]:
# Convert 'Car' and 'Model' columns to category type
df["Car"] = df["Car"].astype("category")
df["Model"] = df["Model"].astype("category")

4. Check Data Types After Fixing
Finally, confirm that all columns have the correct data types.

In [12]:
# Verify updated data types
print("\nUpdated Data Types:")
print(df.dtypes)


Updated Data Types:
Car       category
Model     category
Volume       int32
Weight       int32
CO2          int64
dtype: object


 # Filtering and Selecting Data
Pandas allows you to filter and select rows or columns based on specific conditions.

In [13]:
# Filter rows where CarBrand is Ford
df_filtered = df[df['Car'] == 'Ford']  
print(df_filtered)

     Car   Model  Volume  Weight  CO2
7   Ford  Fiesta    1500    1112   98
11  Ford  Fiesta    1000    1112   99
16  Ford   Focus    2000    1328  105
17  Ford  Mondeo    1600    1584   94
28  Ford   B-Max    1600    1235  104


# Dropping unnecessary columns
It helps streamline the dataset by removing irrelevant or redundant information, making the data more focused and easier to analyze.

In [14]:
# Dropping the 'CO2' column from the dataframe
df.drop(columns='CO2', inplace=True)

# Print the modified dataframe
print(df)

           Car       Model  Volume  Weight
0       Toyoty        Aygo    1000     790
1   Mitsubishi  Space Star    1200    1160
2        Skoda      Citigo    1000     929
3         Mini      Cooper    1500    1140
4           VW         Up!    1000     929
5        Skoda       Fabia    1400    1109
6     Mercedes     A-Class    1500    1365
7         Ford      Fiesta    1500    1112
8         Audi          A1    1600    1150
9      Hyundai         I20    1100     980
10      Suzuki       Swift    1300     990
11        Ford      Fiesta    1000    1112
12       Honda       Civic    1600    1252
13      Hundai         I30    1600    1326
14        Opel       Astra    1600    1330
15       Skoda       Rapid    1600    1119
16        Ford       Focus    2000    1328
17        Ford      Mondeo    1600    1584
18        Opel    Insignia    2000    1428
19    Mercedes     C-Class    2100    1365
20       Skoda     Octavia    1600    1415
21       Volvo         S60    2000    1415
22    Merce

# Renaming Columns (Optional)
For your dataset, you can rename the columns to make them shorter, more consistent, or easier to read.

In [15]:
# Rename columns for consistency
df.rename(columns={"Car": "CarName", "Model": "CarModel", "Volume": "EngineVolume", "Weight": "CarWeight"}, inplace=True)

print("\nRenamed Columns:")
print(df.head())


Renamed Columns:
      CarName    CarModel  EngineVolume  CarWeight
0      Toyoty        Aygo          1000        790
1  Mitsubishi  Space Star          1200       1160
2       Skoda      Citigo          1000        929
3        Mini      Cooper          1500       1140
4          VW         Up!          1000        929


# Analyzing Cleaned Data
After cleaning, summarize the data and analyze it.

In [16]:
# Summarize numerical data
print("\nSummary of Cleaned Dataset:")
print(df.describe())


Summary of Cleaned Dataset:
       EngineVolume    CarWeight
count     31.000000    31.000000
mean    1603.225806  1287.645161
std      378.139190   236.874024
min     1000.000000   790.000000
25%     1450.000000  1115.500000
50%     1600.000000  1328.000000
75%     2000.000000  1421.500000
max     2500.000000  1746.000000


# Checking Unique Values in Categorical Columns
Inspecting the unique values in categorical columns helps identify distinct categories, ensure data consistency, and detect any anomalies or outliers within the dataset

In [17]:
print("\nUnique Values in a Column:")
print(df["CarModel"].unique())  


Unique Values in a Column:
['Aygo', 'Space Star', 'Citigo', 'Cooper', 'Up!', ..., 'E-Class', 'XC70', 'B-Max', 'Zafira', 'SLK']
Length: 30
Categories (30, object): ['A-Class', 'A1', 'A4', 'A6', ..., 'Up!', 'V70', 'XC70', 'Zafira']


# Checking Data Distribution
An optional step involves visualizing the distribution of numerical data or analyzing the frequency of categorical variables.

In [18]:
# Count frequency of unique values in 'Car Name'
print("\nFrequency of Car Name:")
print(df["CarName"].value_counts())


Frequency of Car Name:
CarName
Ford          5
Mercedes      5
Skoda         4
Audi          3
Opel          3
Volvo         3
Honda         1
Hundai        1
Hyundai       1
Mini          1
Mitsubishi    1
Suzuki        1
Toyoty        1
VW            1
Name: count, dtype: int64


# Code to Save to a New CSV File
To save the modified DataFrame into a new CSV file, you can use the to_csv() method in pandas.

In [65]:
# Saving the DataFrame to a new CSV file
df.to_csv('modified_data.csv', index=False)

print("Data saved to 'modified_data.csv'")

Data saved to 'modified_data.csv'
