
@Author:Nagashree C R<br>
@Date:30-09-2024<br>
@Last modified By:Nagashree C R<br>
@Last Modified Date:30-09-2024<br>
@Title: Applied different data preprocessing steps :<br>
a. Handling missing data <br>
b. Handling categorical data <br>
c. Split the dataset into training set and test set <br>
d. Feature scaling <br>


<b><h3>Task1:Classify the problems as supervised, unsupervised, reinforcement or semi supervised 

**Answers and Reasons:**

<h4><b>a.</b> Spam filtering: Is an email spam or not</h4>
<p>Answer: Supervised Learning</p>
<p>Reason: Classifies emails based on labeled examples of spam and non-spam.</p>

<h4><b>b.</b> Given a list of customers and information about them, discover groups of similar users.</h4>
<p>Answer: Unsupervised Learning</p>
<p>Reason: Identifies patterns in data without labeled outcomes.</p>

<h4><b>c.</b> Robotics: A robot is in a maze, and it needs to find a way out.</h4>
<p>Answer: Reinforcement Learning</p>
<p>Reason: Learns to navigate through feedback based on its actions.</p>

<h4><b>d.</b> Training an AI for a complex game such as Civilization or Dota</h4>
<p>Answer: Reinforcement Learning</p>
<p>Reason: Learns strategies through trial and error while playing the game.</p>

<h4><b>e.</b> Anomaly detection: Given measurements from sensors in a manufacturing facility, identify anomalies.</h4>
<p>Answer: Unsupervised Learning (or Semi-Supervised)</p>
<p>Reason: Finds unusual patterns without labeled examples.</p>

<h4><b>f.</b> Discover patterns in data such as whenever it rains, people tend to stay indoors.</h4>
<p>Answer: Unsupervised Learning</p>
<p>Reason: Identifies relationships in data without specific labels.</p>

<h4><b>g.</b> Given information about a house, predict its price</h4>
<p>Answer: Supervised Learning</p>
<p>Reason: Uses labeled data (features and prices) for prediction.</p>

<h4><b>h.</b> Netflix: Given a user and a movie, predict the rating the user is going to give to the movie</h4>
<p>Answer: Supervised Learning</p>
<p>Reason: Uses historical data to predict future ratings based on features.</p>

<h4><b>i.</b> Given an image, output which objects are present in the image</h4>
<p>Answer: Supervised Learning</p>
<p>Reason: Requires labeled images to train a model for object identification.</p>


<b><h3>Task1:Data Preprocessing Steps

In [1]:
import sklearn
print(sklearn.__version__)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline



1.5.2


In [22]:
# Replace the URL with the appropriate path to the CSV file
url = r"C:\Users\ASHAY\OneDrive\Desktop\bridgelabz\final_ML\ML_Programs\Machine_Learning\preprocessing\dataPreprocessing.csv"
data = pd.read_csv(url)

In [3]:
print(data)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [4]:
data.shape   # 10 rows and 4 columns

(10, 4)

<h2><b>b<b>. Handling categorical data</h2>
Separating the Independent variable and dependent variable values from the dataset
          x+y=z          x+z=y

In [5]:
x=data[['Country','Age','Salary']].values   # Indipendent variables
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [6]:
y=data[['Purchased']].values
y

array([['No'],
       ['Yes'],
       ['No'],
       ['No'],
       ['Yes'],
       ['Yes'],
       ['No'],
       ['Yes'],
       ['No'],
       ['Yes']], dtype=object)

<h2><b>a</b>. Handling missing data 
Handling missing data because this values are leakes the data while training the model


In [7]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')

imputer = imputer.fit(x[:,1:3])
x[:,1:3]=imputer.transform(x[:,1:3])
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Converting the categorial data into numerical
    converting country column into binary

In [8]:
from sklearn.preprocessing import LabelEncoder
lable_encode_x = LabelEncoder()
x[:,0] = lable_encode_x.fit_transform(x[:,0])
x


array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

One Hot Encoding for country column 

In [9]:
from sklearn.preprocessing import OneHotEncoder
onehotEncoder = OneHotEncoder()
onehotEncoder.fit_transform(data.Country.values.reshape(-1,1)).toarray()

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

One Hot Encoding for Purchased Column


In [10]:
lablede_encoder_y = LabelEncoder()
y=lablede_encoder_y.fit_transform(y)

y

  y = column_or_1d(y, warn=True)


array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])


<h2><b>c.<b> Split the dataset into training set and test set </h2>

In [11]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train

array([[1, 40.0, 63777.77777777778],
       [0, 37.0, 67000.0],
       [2, 27.0, 48000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [2, 38.0, 61000.0],
       [0, 44.0, 72000.0],
       [0, 35.0, 58000.0]], dtype=object)

In [12]:
x_test

array([[1, 30.0, 54000.0],
       [1, 50.0, 83000.0]], dtype=object)

In [13]:
y_test


array([0, 0])

In [14]:
y_train

array([1, 1, 1, 0, 1, 0, 0, 1])

<h2><b>d<b>.Feature scaling  </h2>


In [15]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_train

array([[ 0.13483997,  0.26306757,  0.12381479],
       [-0.94387981, -0.25350148,  0.46175632],
       [ 1.21355975, -1.97539832, -1.53093341],
       [ 1.21355975,  0.05261351, -1.11141978],
       [-0.94387981,  1.64058505,  1.7202972 ],
       [ 1.21355975, -0.0813118 , -0.16751412],
       [-0.94387981,  0.95182631,  0.98614835],
       [-0.94387981, -0.59788085, -0.48214934]])

In [16]:
x_test = sc.fit_transform(x_test)
x_test

array([[ 0., -1., -1.],
       [ 0.,  1.,  1.]])

Data preprocesssing steps is over now the model is ready we can apply the machine learnig algorithem on the model

<h2><b>e</b>. Handling missing data 

In [17]:

print(data.isnull().sum())
data_cleaned = data.dropna() #Dropping rows with missing values
print(data_cleaned)


Country      0
Age          1
Salary       1
Purchased    0
dtype: int64
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


 Traversing  each column to fill missing values

In [18]:

for column_name in data.columns:

    if data[column_name].dtype in ['int64', 'float64']:  # Numerical columns
        data[column_name] = data[column_name].fillna(data[column_name].mean())

    elif data[column_name].dtype == 'object':  # Categorical columns
        data[column_name] = data[column_name].fillna(data[column_name].mode()[0])

print("\nMissing values after filling:")
print(data.isnull().sum())

# Display the filled DataFrame
print(data)


Missing values after filling:
Country      0
Age          0
Salary       0
Purchased    0
dtype: int64
   Country        Age        Salary Purchased
0   France  44.000000  72000.000000        No
1    Spain  27.000000  48000.000000       Yes
2  Germany  30.000000  54000.000000        No
3    Spain  38.000000  61000.000000        No
4  Germany  40.000000  63777.777778       Yes
5   France  35.000000  58000.000000       Yes
6    Spain  38.777778  52000.000000        No
7   France  48.000000  79000.000000       Yes
8  Germany  50.000000  83000.000000        No
9   France  37.000000  67000.000000       Yes


<h2><b>f<b>. Handling categorical data</h2>

In [19]:

categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

# One-hot encoding for categorical columns
data_encoded = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
print("\nEncoded DataFrame:")
print(data_encoded.head())


Encoded DataFrame:
    Age        Salary  Country_Germany  Country_Spain  Purchased_Yes
0  44.0  72000.000000            False          False          False
1  27.0  48000.000000            False           True           True
2  30.0  54000.000000             True          False          False
3  38.0  61000.000000            False           True          False
4  40.0  63777.777778             True          False           True


<h2><b>f.<b> Split the dataset into training set and test set </h2>

In [20]:
X = data_encoded.drop('Purchased_Yes', axis=1)  # labels
y = data_encoded['Purchased_Yes']  # Target variable

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

# Optionally, display the first few rows of the training set

# Display results
print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:\n", y_train)
print("y_test:\n", y_test)

Training set shape: (8, 4) (8,)
Test set shape: (2, 4) (2,)
X_train:
          Age        Salary  Country_Germany  Country_Spain
5  35.000000  58000.000000            False          False
0  44.000000  72000.000000            False          False
7  48.000000  79000.000000            False          False
2  30.000000  54000.000000             True          False
9  37.000000  67000.000000            False          False
4  40.000000  63777.777778             True          False
3  38.000000  61000.000000            False           True
6  38.777778  52000.000000            False           True
X_test:
     Age   Salary  Country_Germany  Country_Spain
8  50.0  83000.0             True          False
1  27.0  48000.0            False           True
y_train:
 5     True
0    False
7     True
2    False
9     True
4     True
3    False
6    False
Name: Purchased_Yes, dtype: bool
y_test:
 8    False
1     True
Name: Purchased_Yes, dtype: bool


<h2><b>g<b>.Feature scaling  </h2>

In [21]:
# Initialize the scaler
scaler = StandardScaler()  # or MinMaxScaler()

# Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)

# Display the shapes of the resulting datasets
print("Training set shape:", X_train_scaled.shape, y_train.shape)
print("Test set shape:", X_test_scaled.shape, y_test.shape)

# Optionally, display the first few rows of the scaled training set
print("\nFirst few rows of the scaled training set:")
print(X_train_scaled[:5])  # Displaying the first 5 rows

Training set shape: (8, 4) (8,)
Test set shape: (2, 4) (2,)

First few rows of the scaled training set:
[[-0.7529426  -0.62603778 -0.57735027 -0.57735027]
 [ 1.00845381  1.01304295 -0.57735027 -0.57735027]
 [ 1.79129666  1.83258331 -0.57735027 -0.57735027]
 [-1.73149616 -1.09434656  1.73205081 -0.57735027]
 [-0.36152118  0.42765698 -0.57735027 -0.57735027]]
