# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
## Not for grading

## Learning Objective

The objective of this experiment is to understand Linear classifier

## Dataset

The dataset chosen for this  experiment is a handmade fruits dataset. The dataset contains 69 records. Each record represents the following details of fruits : 

*  Weight -   It is the mass of an object. With respect to this dataset, we have calculated the weights in grams 

* Sphericity -   is a measure of how closely the shape of an object approaches that of a mathematically perfect sphere.

* Color -  Every fruit has a different color at different stages. You can encode the color to an integer value. For example

     - Green as 20
     - Greenish Yellow as 40
     - Orange as 60
     - Red as 80
     - Reddish Yellow as 100

*  Label -   We have considered two fruits for simplicity. They are Apple and Orange.




In [None]:
 !wget https://cdn.talentsprint.com/aiml/Experiment_related_data/fruits_weight_sphercity.csv

### Importing Required Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import SGDClassifier

### Loading the data

In [None]:
fruits_data = pd.read_csv("fruits_weight_sphercity.csv")
fruits_data.head()

In [None]:
fruits_data['Color'] = fruits_data['Color'].replace(['Green', 'Greenish yellow','Orange', 'Red','Reddish yellow'],[20, 40, 60, 80, 100])  
fruits_data['labels'] = fruits_data['labels'].replace(['apple','orange'],[1, 0])

**To get better understanding of the data let us visualize first five rows of the data using head () and last five rows of the data using tail()**

In [None]:
fruits_data.head()

In [None]:
fruits_data.tail()

There are a few noisy samples in the data which is skew in the accuracies. So here is the code to drop them. However before un-commenting the code below, go through the experiment and visualize those noisy samples and then re-run the experiment after un-commenting the lines below.

In [None]:
#fruits_data = fruits_data.drop(fruits_data[(fruits_data['labels'] == 1) & (fruits_data['Weight'] > 325)].index)
#fruits_data = fruits_data.drop(fruits_data[(fruits_data['labels'] == 0) & (fruits_data['Weight'] < 290)].index)
#fruits_data.head()
# To understand the above code properly look at the plot and also try to drop the noisy data of class 0 & class 1.

### Storing data and labels in two seperate variables


In [None]:
data = fruits_data[["Weight","Color","Sphericity"]] 

In [None]:
labels = fruits_data["labels"]

In [None]:
data.shape, type(data)

### Visualizing the data

 Let us plot 2 parameters (out of the three) for visualization. (If you're interested in plotting in 3-D, which might be of help here, you can explore Matplotlib's Axes3D [here](https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html))

In [None]:
apples = fruits_data[fruits_data['labels']== 1] # apples are 1
oranges = fruits_data[fruits_data['labels']== 0] # oranges are 0 

In [None]:
plt.plot(apples.Weight, apples.Sphericity, "ro")
plt.plot(oranges.Weight, oranges.Sphericity, "bo")

plt.xlabel("Weight -- in grams")
plt.ylabel("Sphericity -- r-o-y-g-b-p")

plt.legend(["Apples", "Oranges"])

#plt.plot([373], [1], "ko")
plt.show()

### Splitting the data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

In [None]:
# Let us see the size of train and test sets
X_train.shape, X_test.shape

### Training a  Linear Classifier 

In [None]:
linear_classifier = SGDClassifier(random_state=42)

In [None]:
# Training or fitting the model with the train data
linear_classifier.fit(X_train, y_train)

# Testing the trained model
y_pred = linear_classifier.predict(X_test)

In [None]:
# Calculating the score
linear_classifier.score(X_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Not happy with the accuracies? How about trying to see which exact samples caused the accuracies to drop? (Especially given that this is a small dataset which can be doable. This sort of analysis is infeasible on large-datasets)