<a href="https://colab.research.google.com/github/Harithapharidas/DataScience/blob/main/Haritha_P_Haridas__KNN2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

Nowadays, social media advertising is one of the popular forms of advertising. Advertisers can utilise user's demographic information and target their ads accordingly.  You are given a dataset having the following attributes:

|Field|Description|
|---:|:---|
|UserID|Unique ID|
|Gender|Male or Female|
|Age|Age of a person|
|EstimatedSalary|Salary of a person|
|Purchased|‘0’ or ‘1’. ‘0’ means not purchased and ‘1’ means purchased.|


**Source:** https://www.kaggle.com/rishabhsingh98/social-network-ads

**Citation:** Rishabh Singh. (2020). Social Network Ads.

Implement kNN Classifier to determine whether a user will purchase a particular product displayed on a social network ad or not.

---

### List of Activities

**Activity 1:** Import Modules and Read Data

  
**Activity 2:**  Perform Train-Test Split

**Activity 3:**  Determine the Optimal Value of  $k$

**Activity 4:** Build kNN Classifier Model






---

#### Activity 1: Import Modules and Read Data

Import the necessary Python packages.

Read the data from a CSV file to create a Pandas DataFrame.

**Dataset-->**  social-network-ads.csv

Also, print the first five rows of the dataset. Check for null values and treat them accordingly.


In [None]:
# Import all the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the dataset
df=pd.read_csv("/content/social-network-ads - social-network-ads.csv")


# Print first five rows using head() function
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [None]:
# Check if there are any null values. If any column has null values, treat them accordingly
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [None]:
df.isnull().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

**Q:** Are there any missing or null values in the dataset?

**A:** No.

---

#### Activity 2: Perform Train-Test Split

In this dataset, `Purchased` is the target variable and all other columns other than `Purchased` are feature variables.

Create two separate DataFrames, one containing the feature variables and the other containing the target variable. Also, drop the `User ID` column from the features DataFrame as it is of no use.





In [None]:
# Split the dataset into dependent and independent features
x=df[["User ID","Gender","Age","EstimatedSalary"]]
y=df[["Purchased"]]
print(x)
print(y)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=100)


      User ID  Gender  Age  EstimatedSalary
0    15624510    Male   19            19000
1    15810944    Male   35            20000
2    15668575  Female   26            43000
3    15603246  Female   27            57000
4    15804002    Male   19            76000
..        ...     ...  ...              ...
395  15691863  Female   46            41000
396  15706071    Male   51            23000
397  15654296  Female   50            20000
398  15755018    Male   36            33000
399  15594041  Female   49            36000

[400 rows x 4 columns]
     Purchased
0            0
1            0
2            0
3            0
4            0
..         ...
395          1
396          1
397          1
398          0
399          1

[400 rows x 1 columns]


Print the summary of features DataFrame to determine the data type of each feature variable.

In [None]:
# Use 'info()' function with the features DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


Convert categorical `Gender` feature into numerical  by calling the `get_dummies()` function of `pandas` module and passing features DataFrame as input.




In [None]:
# Use 'get_dummies()' function to convert each categorical column in a DataFrame to numerical.
emc=pd.get_dummies(x)
emc.head()

Unnamed: 0,User ID,Age,EstimatedSalary,Gender_Female,Gender_Male
0,15624510,19,19000,0,1
1,15810944,35,20000,0,1
2,15668575,26,43000,1,0
3,15603246,27,57000,1,0
4,15804002,19,76000,0,1


Split the dataset into train set and test set such that the train set contains 70% of the instances and the remaining instances will become the test set.

In [None]:
df.head(100)

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
95,15709441,Female,35,44000,0
96,15710257,Female,35,25000,0
97,15582492,Male,28,123000,1
98,15575694,Male,35,73000,0


In [None]:
# Split the DataFrame into the train and test sets.
# Perform train-test split using 'train_test_split' function.
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(emc,y,test_size=0.7,random_state=100)
# Print the shape of the train and test sets.
print("Shape of x_train",x_train.shape)
print("Shape of x__test",x_test.shape)
print("Shape of y_train",y_train.shape)
print("Shape of y_test",y_test.shape)


Shape of x_train (120, 5)
Shape of x__test (280, 5)
Shape of y_train (120, 1)
Shape of y_test (280, 1)


After this activity, you must obtain train and test sets so that they can be used for training and testing the kNN Classifier.

#### Activity 4: Build kNN Classifier Model

Deploy the kNN Classifier model for the optimal value of $k$ using the steps given below:   

1. Import the `KNeighborsClassifier` class from the `sklearn.neighbors` module (if not imported yet).

2. Create an object of `KNeighborsClassifier` and pass the optimal $k$ value as 5 to its constructor.

3. Call the `fit()` function using the classifier object and pass the train set as inputs to this function.

4. Perform prediction for train and test sets using the `predict()` function.

5. Also, determine the accuracy score of the train and test sets using the `score()` function.

In [None]:
from sklearn import metrics
# Train kNN Classifier model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Perform prediction using 'predict()' function.
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train)
test_pred=knn.predict(x_test)
train_pred=knn.predict(x_train)


# Call the 'score()' function to check the accuracy score of the train set and test set.
print("Accuracy:",metrics.accuracy_score(y_test,test_pred))
print("Accuracy:",metrics.accuracy_score(y_train,train_pred))

Accuracy: 0.7464285714285714
Accuracy: 0.8083333333333333


  return self._fit(X, y)


Print the classification report to get an in-depth overview of the classifier performance using the `classification_report()` function of `sklearn.metrics` module.

In [None]:
# Display the precision, recall, and f1-score values.
from  sklearn.metrics import classification_report
print(classification_report(y_test,test_pred))

              precision    recall  f1-score   support

           0       0.74      0.94      0.82       178
           1       0.79      0.41      0.54       102

    accuracy                           0.75       280
   macro avg       0.76      0.67      0.68       280
weighted avg       0.76      0.75      0.72       280



**Q:** Write down the f1-scores for both the target labels.

**A:**not purchased 0.54




---

