#**DATA PREPROCESSING**

**Step 1: Importing the libraries**

In [24]:
import pandas as pd
import numpy as np


**Step 2: Importing dataset**

In [25]:
df = pd.read_csv('https://raw.githubusercontent.com/Jaiprakash91194/Assessments/main/Task7/Data.csv')
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [26]:
df.shape

(10, 4)

In [27]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,9.0,38.777778,7.693793,27.0,35.0,38.0,44.0,50.0
Salary,9.0,63777.777778,12265.579662,48000.0,54000.0,61000.0,72000.0,83000.0


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes


**Step 3: Handling the missing data**

In [29]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In age and salary we have each single nan value.

In [30]:
df[['Age', 'Salary']] = df[['Age', 'Salary']].fillna(df[['Age', 'Salary']].mean())
#or df = df.fillna(df.median()) it will act for all columns

In [31]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


**Step 4: Encoding categorical data**

**To extract distinct values for all categorical columns in dataframe**

The code given below was refered from stackoverflow.
[link text](https://stackoverflow.com/questions/59951043/to-extract-distinct-values-for-all-categorical-columns-in-dataframe)

In [32]:

df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
df_unique


Unnamed: 0,Variable,DistinctCount
0,Country,3
1,Age,10
2,Salary,10
3,Purchased,2


In the country and purchased columns we have 3 and 2 categorical values presented respectively.

First we are encoding the target variable 'Purchased' to 0 and 1.

In [33]:
df['Purchased'] = df['Purchased'].eq('Yes').mul(1)

In [34]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,0
1,Spain,27.0,48000.0,1
2,Germany,30.0,54000.0,0
3,Spain,38.0,61000.0,0
4,Germany,40.0,63777.777778,1
5,France,35.0,58000.0,1
6,Spain,38.777778,52000.0,0
7,France,48.0,79000.0,1
8,Germany,50.0,83000.0,0
9,France,37.0,67000.0,1


So, we need to apply one hot encoding to country(feature variable). Befor that we need to create dummy variables.

**Step 5: Creating a dummy variable**

In [35]:
b = []
for i in df.keys():
  b.append(i)
print(b)

['Country', 'Age', 'Salary', 'Purchased']


In [36]:
df = pd.get_dummies(df, columns = ['Country'])

In [37]:
b = []
for i in df.keys():
  b.append(i)
print(b)

['Age', 'Salary', 'Purchased', 'Country_France', 'Country_Germany', 'Country_Spain']


In [38]:
df

Unnamed: 0,Age,Salary,Purchased,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,0,1,0,0
1,27.0,48000.0,1,0,0,1
2,30.0,54000.0,0,0,1,0
3,38.0,61000.0,0,0,0,1
4,40.0,63777.777778,1,0,1,0
5,35.0,58000.0,1,1,0,0
6,38.777778,52000.0,0,0,0,1
7,48.0,79000.0,1,1,0,0
8,50.0,83000.0,0,0,1,0
9,37.0,67000.0,1,1,0,0


**Step 6: Splitting the datasets into training sets and Test sets**

In [39]:
b

['Age',
 'Salary',
 'Purchased',
 'Country_France',
 'Country_Germany',
 'Country_Spain']

In [40]:
b.remove('Purchased')

In [41]:
b

['Age', 'Salary', 'Country_France', 'Country_Germany', 'Country_Spain']

In [42]:
X = df[b].values#array of features
y = df['Purchased'].values

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**Step 7: Feature Scaling**

In [44]:
from sklearn.preprocessing import StandardScaler ## standrard scalig 
scaler = StandardScaler() #initialise to a variable
scaler.fit(X_train,y_train) # we are finding the values of mean and sd from the td
X_train_scaled = scaler.transform(X_train) # fit (mean, sd) and then transform the training data
X_test_scaled = scaler.transform(X_test) # transform the test data 

In [45]:
print(X_train_scaled)

[[ 0.26306757  0.12381479 -1.          2.64575131 -0.77459667]
 [-0.25350148  0.46175632  1.         -0.37796447 -0.77459667]
 [-1.97539832 -1.53093341 -1.         -0.37796447  1.29099445]
 [ 0.05261351 -1.11141978 -1.         -0.37796447  1.29099445]
 [ 1.64058505  1.7202972   1.         -0.37796447 -0.77459667]
 [-0.0813118  -0.16751412 -1.         -0.37796447  1.29099445]
 [ 0.95182631  0.98614835  1.         -0.37796447 -0.77459667]
 [-0.59788085 -0.48214934  1.         -0.37796447 -0.77459667]]


In [46]:
print(X_test_scaled)

[[-1.45882927 -0.90166297 -1.          2.64575131 -0.77459667]
 [ 1.98496442  2.13981082 -1.          2.64575131 -0.77459667]]


**Thank you**