# Lecture 4: Preprocessing

## Course: Machine Learning 
## Instructor: Dr.Shazia Saqib
## Student: Mubashir Malik
## Email: mubshirmalik414@gmail.com

---



### What is Pandas?

**Pandas** is a powerful and easy-to-use open-source Python library used for data manipulation, analysis, and cleaning.  
It provides two main data structures:  
- **Series**: One-dimensional labeled array  
- **DataFrame**: Two-dimensional labeled data structure (like an Excel sheet)

---

### Why is Pandas Important?

-  **Efficient Data Handling**: Easily load, filter, sort, and transform data.
-  **Data Analysis**: Helps in exploring and understanding patterns in the data.
-  **Preprocessing**: Useful for handling missing values, duplicates, encoding, etc.
-  **Supports Multiple Formats**: Read/write data from CSV, Excel, SQL, JSON, and more.
-  **Fast & Flexible**: Built on top of NumPy, making it optimized for performance.

Pandas is a core library in any Data Science or Machine Learning project.


In [17]:
import pandas as pd

 **Here are a few commonly used methods to load datasets into your own notebook and perform data preprocessing in Python**

##  Load the Dataset

In [29]:
# Load the dataset in csv format
df = pd.read_csv("sample_employee_dataset_1000.csv")

In [19]:
# Display the how many rows and columns are in the dataset
df.shape

(1000, 8)

In [30]:
# Display the first 5 rows of the dataset
df.head()

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Is Active,Comments
0,1,Linda,24,Female,87372.0,2021-03-21,True,Team Player
1,2,Sara,40,Male,68565.37,2018-05-20,True,Fast Learner
2,3,Sara,43,Male,100076.38,2023-12-03,False,Fast Learner
3,4,Rachel,42,Female,43901.7,2016-11-14,False,Creative
4,5,Rachel,44,Female,76021.57,2018-09-13,False,Team Player


In [21]:
# Display the last 5 rows of the dataset
df.tail()

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Is Active,Comments
995,996,Tom,27,Male,79282.25,2018-09-29,False,Dedicated
996,997,Paul,26,Male,33355.94,2017-10-13,True,Creative
997,998,Bob,53,Female,71727.28,2019-01-14,True,Creative
998,999,Paul,47,Female,47286.63,2015-05-15,False,Reliable
999,1000,Sara,60,Female,77410.69,2023-06-08,False,Fast Learner


In [22]:
# Display the basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            1000 non-null   int64  
 1   Name          1000 non-null   object 
 2   Age           1000 non-null   int64  
 3   Gender        1000 non-null   object 
 4   Salary        1000 non-null   float64
 5   Joining Date  1000 non-null   object 
 6   Is Active     1000 non-null   bool   
 7   Comments      1000 non-null   object 
dtypes: bool(1), float64(1), int64(2), object(4)
memory usage: 55.8+ KB


In [23]:
df.dtypes

ID                int64
Name             object
Age               int64
Gender           object
Salary          float64
Joining Date     object
Is Active          bool
Comments         object
dtype: object

In [24]:
# To check for missing values in the dataset
df.isnull().sum()

ID              0
Name            0
Age             0
Gender          0
Salary          0
Joining Date    0
Is Active       0
Comments        0
dtype: int64

In [25]:
# Handling missing values
# If age or Salary is missing, we will fill with middle(median) value
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# If gender or Name is missing, we will fill with maximum(mode) value
df["Name"] = df["Name"].fillna(df["Name"].mode()[0])
df["Gender"] = df["Gender"].fillna(df["Name"].mode()[0])

# if is Active is missing, we will fill with maximum(mode) value
df["Is Active"] = df["Is Active"].fillna(df["Is Active"].mode()[0])

#if Joining Date is missing, we will fill with earliest(min) date
df["Joining Date"] = df["Joining Date"].fillna(df["Joining Date"].min())

#If comment is missing, we will fill with unknown
df["Comments"] = df["Comments"].fillna("Unknown")


print("Missing values have been filed and saved in 'sample_employee_dataset_1000.csv'")
# Save the cleaned dataset to a new CSV file    


Missing values have been filed and saved in 'sample_employee_dataset_1000.csv'


In [26]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])  #Encode gender column
df['Is Active'] = le.fit_transform(df['Is Active'])  #Encode is active column

print(df.head()) # shows the first 5 rows of the dataset

   ID    Name  Age  Gender     Salary Joining Date  Is Active      Comments
0   1   Linda   24       0   87372.00   2021-03-21          1   Team Player
1   2    Sara   40       1   68565.37   2018-05-20          1  Fast Learner
2   3    Sara   43       1  100076.38   2023-12-03          0  Fast Learner
3   4  Rachel   42       0   43901.70   2016-11-14          0      Creative
4   5  Rachel   44       0   76021.57   2018-09-13          0   Team Player


In [27]:
# Now applying one hot encoding to the categorical columns
# Apply pd.get_dummies for one hot encoding
encoded_df = pd.get_dummies(df['Is Active'], prefix='Is Active')

# concatenate with original dataset and drop old column
df = pd.concat([df, encoded_df], axis=1).drop(columns=['Is Active'])

df.to_csv("sample_employee_dataset_1000_cleaned.csv", index=False) # Save updated dataset 

print (" Is Active converted into one hot encoding and saved! ")

 Is Active converted into one hot encoding and saved! 


In [28]:
df.head(10)

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Comments,Is Active_0,Is Active_1
0,1,Linda,24,0,87372.0,2021-03-21,Team Player,False,True
1,2,Sara,40,1,68565.37,2018-05-20,Fast Learner,False,True
2,3,Sara,43,1,100076.38,2023-12-03,Fast Learner,True,False
3,4,Rachel,42,0,43901.7,2016-11-14,Creative,True,False
4,5,Rachel,44,0,76021.57,2018-09-13,Team Player,True,False
5,6,Eve,50,0,99939.68,2024-11-01,Fast Learner,False,True
6,7,Tom,50,1,102299.24,2016-09-08,Fast Learner,True,False
7,8,Eve,37,1,51058.17,2023-01-13,Reliable,True,False
8,9,Eve,26,0,52352.11,2024-10-07,Creative,False,True
9,10,John,35,0,113214.11,2018-08-14,Hardworking,False,True


# Lecture 5 

In [2]:
# import necessary libraries
import pandas as pd
from scipy import stats

In [3]:
# ignore all warnings, preventing from being displayed during execution
# It is commonly used to surpress non critical warnings in jupyter notebooks or scripts
from warnings import filterwarnings
filterwarnings('ignore')

# Easy Guide to Outlier Detection and Data Preparation

Outliers are extreme values that can distort data analysis and negatively affect machine learning models. Identifying and addressing them is crucial in data preprocessing.

In this notebook, we’ll focus on:

1. Cleaning data by removing duplicates and handling missing values.

2. Detecting outliers using visualizations and statistical methods.

3. Handling outliers by removal or adjustment.

4. Scaling and normalizing data for consistency.

## step 1: Load the Dataset

 **Here are a few commonly used methods to load datasets into your own notebook and perform data preprocessing in Python**

### 1. Load from CSV file

In [4]:
# Load the dataset 
df = pd.read_csv("sample_employee_dataset_1000.csv")
print("Dataset loaded successfully!")

Dataset loaded successfully!


###  2. Load from Excel file

In [6]:
data = pd.read_excel("sample_employee_data_1000.xlsx") # Load the dataset in excel format
print("Dataset loaded successfully!")

Dataset loaded successfully!


### 3. Load from online URL

In [None]:
url = 'https://example.com/path_to_file.csv'
df = pd.read_csv(url)
print("Dataset loaded successfully!")

### 4. Load from Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/MyDrive/your_file.csv')
print("Dataset loaded successfully!")

### Purpose of `df.head()` in Pandas

- **Preview Data**: Quickly view the structure and layout of the dataset.
- **Check Columns**: Inspect column names, their order, and get an idea of data types.
- **Identify Issues**: Spot missing values, formatting problems, or inconsistent data.
- **Debugging**: Verify if the dataset has been loaded correctly.


In [36]:
df.head(10) # Display the first 10 rows of the dataset

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Is Active,Comments
0,1,Linda,24,Female,87372.0,2021-03-21,True,Team Player
1,2,Sara,40,Male,68565.37,2018-05-20,True,Fast Learner
2,3,Sara,43,Male,100076.38,2023-12-03,False,Fast Learner
3,4,Rachel,42,Female,43901.7,2016-11-14,False,Creative
4,5,Rachel,44,Female,76021.57,2018-09-13,False,Team Player
5,6,Eve,50,Female,99939.68,2024-11-01,True,Fast Learner
6,7,Tom,50,Male,102299.24,2016-09-08,False,Fast Learner
7,8,Eve,37,Male,51058.17,2023-01-13,False,Reliable
8,9,Eve,26,Female,52352.11,2024-10-07,True,Creative
9,10,John,35,Female,113214.11,2018-08-14,True,Hardworking


### Purpose of `df.dtypes` in Pandas

- **Check Data Types**: Displays the data type of each column in the DataFrame.
- **Data Understanding**: Helps understand what kind of data each column holds (e.g., `int64`, `float64`, `object`).
- **Preprocessing Help**: Useful in deciding whether encoding, conversion, or scaling is needed.
- **Debugging**: Detect incorrect data types that might affect analysis or model training.


In [39]:
df.dtypes # Display the data types of each column

ID                int64
Name             object
Age               int64
Gender           object
Salary          float64
Joining Date     object
Is Active          bool
Comments         object
dtype: object

## Step 2: Remove Duplicate Rows

### Why Do We Remove Duplicate Rows in a Dataset?

- **Ensure Data Quality**: Duplicate rows can introduce bias and reduce the accuracy of analysis or model performance.
- **Avoid Redundancy**: Reduces repetition in data, making it cleaner and easier to work with.
- **Improve Performance**: Smaller, duplicate-free datasets are faster to process and analyze.
- **Accurate Results**: Prevents skewed results in statistical analysis or machine learning training.
- **Maintain Uniqueness**: In cases like user IDs or transaction records, duplicates can cause serious logic issues.


In [40]:
# Remove the duplicate rows
df_cleaned = df.drop_duplicates()

# Display the cleaned Datasets
df_cleaned.head()

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Is Active,Comments
0,1,Linda,24,Female,87372.0,2021-03-21,True,Team Player
1,2,Sara,40,Male,68565.37,2018-05-20,True,Fast Learner
2,3,Sara,43,Male,100076.38,2023-12-03,False,Fast Learner
3,4,Rachel,42,Female,43901.7,2016-11-14,False,Creative
4,5,Rachel,44,Female,76021.57,2018-09-13,False,Team Player


## Step 3:  Handle Missing Values

### Why Do We Handle Missing Values in a Dataset?

- **Improve Model Accuracy**: Machine learning algorithms cannot work well with missing data — it can reduce performance or cause errors.
- **Prevent Bias**: Ignoring missing values may lead to biased results or incorrect conclusions.
- **Ensure Consistency**: Cleaning missing values helps maintain a consistent dataset for analysis.
- **Avoid Runtime Errors**: Some tools or functions may break if missing values are not handled properly.
- **Better Decision Making**: A complete dataset provides a clearer picture for drawing insights and making predictions.


In [41]:
#Check for missing values
df_cleaned.isnull().sum()

#This dataset has no current missing value

ID              0
Name            0
Age             0
Gender          0
Salary          0
Joining Date    0
Is Active       0
Comments        0
dtype: int64

In [42]:
# if we have missing values then we fill missing values column with mean
# inpalace = True modifies the dataframe directly without creating a new copy
# In this code, it fill missing values in age column with column mean and updates df_cleaned updates
df_cleaned['Age'].fillna(df_cleaned['Age'].mean(),inplace =True)
df_cleaned['Salary'].fillna(df_cleaned['Salary'].mean(),inplace=True)

# Now check again for missing values
df_cleaned.isnull().sum()

ID              0
Name            0
Age             0
Gender          0
Salary          0
Joining Date    0
Is Active       0
Comments        0
dtype: int64

## Step 4: Normalize and Scale the Data

In data preprocessing, **Normalization** and **Scaling** are key techniques to ensure that numerical features are on a similar scale, which can improve the performance of machine learning models, particularly those that rely on distance metrics or assume normally distributed data.

- **Normalization** transforms features to fall within a specific range, often between 0 and 1. This ensures that all features contribute equally to the model’s performance. We’ll use `MinMaxScaler` to perform this transformation on our data.

- **Scaling** refers to adjusting the values of numeric columns so that they fit within a particular range or distribution. A common method is to scale the data so that it has a **mean of 0** and a **standard deviation of 1** (this is called **standardization**). In other cases, scaling can bring the data into a specific range, like **0 to 1** (this is often done with `MinMaxScaler`).

In this step, we’ll apply `MinMaxScaler` to scale the data between 0 and 1, bringing all numerical features to a common range to improve model efficiency and accuracy.


In [43]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the Scalar
scaler = MinMaxScaler()

# Apply Min-Max Scaling to 'Age' and 'Salary'
df_cleaned[['Age','Salary']] = scaler.fit_transform(df_cleaned[['Age','Salary']])

# Show Scaled data
df_cleaned[['Age','Salary']].head()

Unnamed: 0,Age,Salary
0,0.052632,0.637778
1,0.473684,0.428458
2,0.552632,0.779179
3,0.526316,0.153948
4,0.578947,0.511446


## Scaling

In [44]:
# This scales the values to have mean = 0 and standard deviation = 1 for better distribution.
from sklearn.preprocessing import StandardScaler

# initialize the scaler
scaler = StandardScaler()

# Apply normalization
df_cleaned[['Age','Salary']] = scaler.fit_transform(df_cleaned[['Age','Salary']])

# Show normalized data
df_cleaned[['Age','Salary']].head()

Unnamed: 0,Age,Salary
0,-1.564514,0.428272
1,-0.121735,-0.294284
2,0.148787,0.916378
3,0.058613,-1.241869
4,0.23896,-0.007814


# Step 8: Encode Categorical Variables (Optional)

In machine learning, most algorithms require **numerical inputs** to perform computations. Categorical variables, such as 'Gender', 'Color', or 'Country', are typically non-numeric and need to be converted into a format that machine learning models can process. This process is known as **encoding**.

In this step, we’ll use **Label Encoding** to convert categorical variables into numeric form. Label Encoding assigns a unique integer to each category in a feature. For example, the ‘Gender’ column with values like ‘Male’ and ‘Female’ can be encoded as 0 and 1, respectively. This allows the model to handle the data more effectively.

- **Label Encoding**: This method is appropriate when the categorical variables have an inherent order (ordinal). For non-ordinal variables, encoding could introduce an unintended hierarchy. If the variable is nominal (like 'Gender'), you can consider alternatives like **One-Hot Encoding**, which avoids such issues by representing each category as a separate binary feature.

In this case, we’ll apply Label Encoding to transform the 'Gender' variable into numerical format.

In [45]:
from sklearn.preprocessing import LabelEncoder

# Encode gender column
le = LabelEncoder()
df_cleaned['Gender_Encoded'] = le.fit_transform(df_cleaned['Gender'])

# Display Encoded gender column
df_cleaned[['Gender','Gender_Encoded']].head()

Unnamed: 0,Gender,Gender_Encoded
0,Female,0
1,Male,1
2,Male,1
3,Female,0
4,Female,0


## One-Hot Encoding

In [46]:
import pandas as pd
# Apply one hot encoding to all categorical columns
df_cleaned = pd.get_dummies(df_cleaned, drop_first=True)

# Display updated dataFrame
print(df_cleaned.head())

   ID       Age    Salary  Is Active  Gender_Encoded  Name_Bob  Name_Eve  \
0   1 -1.564514  0.428272       True               0     False     False   
1   2 -0.121735 -0.294284       True               1     False     False   
2   3  0.148787  0.916378      False               1     False     False   
3   4  0.058613 -1.241869      False               0     False     False   
4   5  0.238960 -0.007814      False               0     False     False   

   Name_John  Name_Linda  Name_Paul  ...  Joining Date_2024-12-12  \
0      False        True      False  ...                    False   
1      False       False      False  ...                    False   
2      False       False      False  ...                    False   
3      False       False      False  ...                    False   
4      False       False      False  ...                    False   

   Joining Date_2024-12-17  Joining Date_2024-12-23  Joining Date_2024-12-28  \
0                    False                    Fa

# Step 6: Detect Outliers Using Z-Score Method

The **Z-score method** helps us find outliers by measuring how far each value is from the average (mean) in terms of standard deviations. A Z-score tells us how many standard deviations a value is away from the mean.

- If a Z-score is greater than **3**, the value is considered an **outlier** because it's far from the mean.

In this step, we’ll use the Z-score method to identify which data points are unusually far from the average and might be outliers.

In [47]:
# Import required libraries
from scipy.stats import zscore
import numpy as np

# Calculate Z-scores for 'Age' and 'Salary'
z_scores = np.abs(zscore(df_cleaned[['Age', 'Salary']]))

# Identify rows with Z-scores > 3 (outliers)
outliers = (z_scores > 3).any(axis=1)

# Remove the outliers from the datasest
df_no_outliers = df_cleaned[~outliers]

# Display cleaned dataset without outliers
df_no_outliers.head()

Unnamed: 0,ID,Age,Salary,Is Active,Gender_Encoded,Name_Bob,Name_Eve,Name_John,Name_Linda,Name_Paul,...,Joining Date_2024-12-12,Joining Date_2024-12-17,Joining Date_2024-12-23,Joining Date_2024-12-28,Comments_Dedicated,Comments_Fast Learner,Comments_Hardworking,Comments_Needs Improvement,Comments_Reliable,Comments_Team Player
0,1,-1.564514,0.428272,True,0,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,True
1,2,-0.121735,-0.294284,True,1,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2,3,0.148787,0.916378,False,1,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3,4,0.058613,-1.241869,False,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,5,0.23896,-0.007814,False,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


# Final Thoughts
- Remove duplicates
- Handle missing Values
- Normalize and Scale Numerical data
- Detect and Remove outliers using visualization and Z-score methods
- Encode Categorical variables for machine learning

These preprocessing steps are essential for ensuring clean and reliable for analysis or modeling!

# Lecture 6: Linear Regression

In [2]:
# import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [13]:
# Load the dataset
df = pd.read_csv("sample_employee_dataset_1000.csv")
print("Dataset loaded successfully!")

Dataset loaded successfully!


In [14]:
# Display dataset information if loaded
print("\nDataset preview:")
print(df.head())
print("\nColumn Names:")
print(df.columns)



Dataset preview:
   ID    Name  Age  Gender     Salary Joining Date  Is Active      Comments
0   1   Linda   24  Female   87372.00   2021-03-21       True   Team Player
1   2    Sara   40    Male   68565.37   2018-05-20       True  Fast Learner
2   3    Sara   43    Male  100076.38   2023-12-03      False  Fast Learner
3   4  Rachel   42  Female   43901.70   2016-11-14      False      Creative
4   5  Rachel   44  Female   76021.57   2018-09-13      False   Team Player

Column Names:
Index(['ID', 'Name', 'Age', 'Gender', 'Salary', 'Joining Date', 'Is Active',
       'Comments'],
      dtype='object')


In [15]:
# Rename Column name for clarity
df.columns = ['ID', 'Name', 'Age', 'Gender', 'Salary', 'Joining Date', 'Is Active', 'Comments']
print("\nRenamed Columns:")
print(df.columns)


Renamed Columns:
Index(['ID', 'Name', 'Age', 'Gender', 'Salary', 'Joining Date', 'Is Active',
       'Comments'],
      dtype='object')


In [16]:
# Select numeric colulmns for regression
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numeric_columns.remove('ID')  # Remove ID column if present it's not needed for regression
print("\nNumeric Columns:")
print(numeric_columns)


Numeric Columns:
['Age', 'Salary']


In [17]:
# Define features X and target y variable
X = df[['Age']] #Dataframe
Y = df['Salary'] #Series 

In [18]:
# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [19]:
#create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, Y_train)

# prediction on test data
Y_predict = model.predict(X_test)



In [20]:
#Display model coefficients and intercept
print("\nModel Coefficients:")
print(f"Intercept: {model.intercept_}")
print(f"Slope: {model.coef_[0]}")



Model Coefficients:
Intercept: 80672.07825685601
Slope: -89.06807843594666


In [21]:
# predict for new data
new_data = pd.DataFrame({'Age': [25,30, 35, 40,45]}) # Create a DataFrame for new data
new_predictions = model.predict(new_data) # Predict using the model

print("\nNew Predictions:")
for age, prediction in zip(new_data['Age'], new_predictions):
    print(f"Age: {age}, Predicted Salary: {prediction:.2f}")



New Predictions:
Age: 25, Predicted Salary: 78445.38
Age: 30, Predicted Salary: 78000.04
Age: 35, Predicted Salary: 77554.70
Age: 40, Predicted Salary: 77109.36
Age: 45, Predicted Salary: 76664.01
