## 1.Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
diabetes_df = pd.read_csv('../week-14-repository/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
#Checking the class ratio 
#calculating the number of data that belong to each class in Outcome variable
diabetes_df['Outcome'].value_counts()


#The data is pretty imbalanced, where the majority class belongs to the “0” ( negative) 
# and the minority class belongs to the “1” (positive).

0    500
1    268
Name: Outcome, dtype: int64

In [5]:
#Importing Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Deciding independent(X) and depencdent(y) variables:
X = diabetes_df.drop('Outcome',axis=1)
y = diabetes_df['Outcome']

#Splitting train and test 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5,random_state = 10)

#Standardize 
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)



## Without using Smote-enn

In [6]:
#Checking the model performance without usiing Smote-enn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaler,y_train)
y_pred = model.predict(X_test_scaler)

from sklearn.metrics import accuracy_score
accuracy_score=accuracy_score(y_test,y_pred)
print("Model performance without Smote-enn is :",accuracy_score)

Model performance without Smote-enn is : 0.7838541666666666


In [7]:
#Generating classification report
from sklearn.metrics import classification_report


print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.89      0.80      0.84       280
           1       0.58      0.74      0.65       104

    accuracy                           0.78       384
   macro avg       0.74      0.77      0.75       384
weighted avg       0.81      0.78      0.79       384



## With SMOTE-ENN

In [123]:
#the accuracy score is pretty high, but the recall score is slightly lower (around 0.74). 
#This means that the model performance to correctly predict the minority class label is not good enough.

In [8]:
#importing libraries
#this method combines the SMOTE ability to generate synthetic examples for minority class and ENN ability to delete 
#some observations from both classes that are identified as having different class between the observation’s class and 
#its K-nearest neighbor majority class.

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN (random_state=42)
X_resampled,y_resampled =smote_enn.fit_resample(X_train_scaler,y_train)

In [9]:
#train using resampled data

model = LogisticRegression(random_state=42)
model.fit(X_resampled,y_resampled)
y_pred = model.predict(X_test_scaler)

from sklearn.metrics import balanced_accuracy_score

balanced_acc_score=balanced_accuracy_score(y_test,y_pred)

print("Model performance with Smote-enn is :",balanced_acc_score)

Model performance with Smote-enn is : 0.735793667435521


In [126]:
#Classification report
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))


                   pre       rec       spe        f1       geo       iba       sup

          0       0.88      0.63      0.84      0.73      0.73      0.52       251
          1       0.55      0.84      0.63      0.66      0.73      0.54       133

avg / total       0.77      0.70      0.77      0.71      0.73      0.53       384



The recall score after using smote-enn has increased, although the accuracy and precision score are slightly decreased. 
This means that the model performance to correctly predict the minority class label is getting better by
using SMOTE-ENN to balance the data.

## 2.Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

## With SMOTE

In [133]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome',axis = 1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)


from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train_scaler, y_train)

#train using the resampled data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

#calculate the accuracy score
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)



print()

0.7268518518518519

In [12]:
#Classification report
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.77      0.70      0.80      0.74      0.54       251
          1       0.62      0.70      0.77      0.66      0.74      0.54       133

avg / total       0.76      0.75      0.72      0.75      0.74      0.54       384



In my opinion,The performance of the combined sampling is slightly higher than the SMOTE and other techniques that swe have used so far for our diabetes dataset.
SMOTE -ENN technique is the best in predicting the minority label in a better way as we have a good recall value.

## 3.What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Overview:

An outlier can be considered as an odd man out in a series of data. Outliers can be unusually and extremely different from most of the data points existing in our sample. It could be a very large observation or a very small observation. Outliers can create biased results while calculating the stats of the data due to its extreme nature, thereby affecting further statistical/ML models.
Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data.

Methods for Outlier Detection :
1) The simplest way to detect an outlier is by graphing the features or the data points. Visualization is one of the best and easiest ways to have an inference about the overall data and the outliers. Scatter plots and box plots are the most preferred visualization tools to detect outliers.

2) Histograms can also be used to identify outlier. However in a histogram, existence of outliers can be detected by isolated bars.

3) InterQuartile range (IQR) technique: This method can be used to find the maximum and minimum values of data points that are outliers by calculating the boundaries.


4) Z score is an important concept in statistics. Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.

    Z score = (x -mean) / std. deviation 
    
    
Z score and Outliers:
If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.
data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]
mean = np.mean(data)
std = np.std(data)
print('mean of the dataset is', mean)
print('std. deviation is', std)


threshold = 3
outlier = []
for i in data:
    z = (i-mean)/std
    if z > threshold:
        outlier.append(i)
print('outlier in dataset is', outlier)


5) There are various statistical tests that can be performed to detect outliers and one of them is the hypothesis testing. Below three statistical tests use the concept of hypothesis testing to identify outliers.
 o Grubbs’ test
 o Chi –square test.
 o Dixon’s Q test.
 
 
In Grubbs’ test and Dixon’s Q test, it is assumed that the data on which we are going to find outliers is normally distributed.
Whereas Chi-square test can be used for the same with the chi-square distribution
Dixon’s Q test are generally applied for datasets or samples containing very few observations and hence rarely used in data science.

## 4.	Perform a linear SVM to predict credit approval (last column) using this dataset: 

In [13]:
#Reading data into pandas and renaming the columns:
import pandas as pd 
credit_df = pd.read_csv('Australian.CSV',header=None,names=['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','Target'])
credit_df

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,Target
0,1,22.08,11.460,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.000,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.750,1,4,4,1.250,0,0,0,1,2,280,1,0
3,0,21.67,11.500,1,5,3,0.000,1,1,11,1,2,0,1,1
4,1,20.17,8.170,2,6,4,1.960,1,1,14,0,2,60,159,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,31.57,10.500,2,14,4,6.500,1,0,0,0,2,0,1,1
686,1,20.67,0.415,2,8,4,0.125,0,0,0,0,2,0,45,0
687,0,18.83,9.540,2,6,4,0.085,1,0,0,0,2,100,1,1
688,0,27.42,14.500,2,14,8,3.085,1,1,1,0,2,120,12,1


In [16]:
#Declaring category and continous DF's
credit_df_cat=credit_df[['A1','A4','A5','A6','A8','A9','A11','A12','Target']]
credit_df_cont=credit_df[['A2','A3','A7','A10','A13','A14','Target']]

In [17]:
credit_df_cont.describe()

Unnamed: 0,A2,A3,A7,A10,A13,A14,Target
count,690.0,690.0,690.0,690.0,690.0,690.0,690.0
mean,31.568203,4.758725,2.223406,2.4,184.014493,1018.385507,0.444928
std,11.853273,4.978163,3.346513,4.86294,172.159274,5210.102598,0.497318
min,13.75,0.0,0.0,0.0,0.0,1.0,0.0
25%,22.67,1.0,0.165,0.0,80.0,1.0,0.0
50%,28.625,2.75,1.0,0.0,160.0,6.0,0.0
75%,37.7075,7.2075,2.625,3.0,272.0,396.5,1.0
max,80.25,28.0,28.5,67.0,2000.0,100001.0,1.0


In [18]:
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      690 non-null    int64  
 1   A2      690 non-null    float64
 2   A3      690 non-null    float64
 3   A4      690 non-null    int64  
 4   A5      690 non-null    int64  
 5   A6      690 non-null    int64  
 6   A7      690 non-null    float64
 7   A8      690 non-null    int64  
 8   A9      690 non-null    int64  
 9   A10     690 non-null    int64  
 10  A11     690 non-null    int64  
 11  A12     690 non-null    int64  
 12  A13     690 non-null    int64  
 13  A14     690 non-null    int64  
 14  Target  690 non-null    int64  
dtypes: float64(3), int64(12)
memory usage: 81.0 KB


In [19]:
from sklearn.feature_selection import RFE 
from sklearn.svm import SVR
estimator = SVR(kernel="linear")
rfe = RFE(estimator,step=1)
rfe = rfe.fit(X_train, y_train)
rfe 

selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
                                      'Ranking':rfe.ranking_})
selected_rfe_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
0,Pregnancies,1
1,Glucose,1
5,BMI,1
6,DiabetesPedigreeFunction,1
7,Age,2
2,BloodPressure,3
3,SkinThickness,4
4,Insulin,5


In [21]:
X = credit_df.drop('Target',axis=1)
y =credit_df['Target']

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)


In [22]:
len(X_train)

483

In [31]:
len(X_test)

207

In [26]:
#Checking the class ratio 
credit_df['Target'].value_counts()

0    383
1    307
Name: Target, dtype: int64

In [32]:
from sklearn.svm import SVC
model= SVC(kernel = 'linear',C=1,gamma=100)
model.fit(X_train_scaler,y_train)

from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test_scaler)
accuracy_score(y_test, y_pred)



0.8502415458937198

## 5.	How did the SVM model perform? Use a classification report. 

In [28]:
#Checking model performance

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.81      0.87       126
           1       0.76      0.91      0.83        81

    accuracy                           0.85       207
   macro avg       0.85      0.86      0.85       207
weighted avg       0.87      0.85      0.85       207



## 6.	What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

There are a number of data jobs available in the market based on different experience levels.Jobs that match the skills I
have learned would be of :
   Data Analyst 
   Data Scientist
   Data Engineer

Some of the reason for my interest in these jobs are :
    
    1) Companies are Facing Real Challenges in Organizing Data so they need people who can understand and use 
      their data in an interesting way.
    2) Great pay 
    3) Lot of scope for career advancement
    4) Challenging day to day work that focuses on problem solving skills.
    5) There are a plethora of other commonly-used job titles that involve data science work.Roles such as that 
    of a Data Scientist, Data Architect, BI Engineer, Business Analyst, Data Engineer, Database Administrator, 
    Data- and Analytics Manager are in high demand.
    6)  Omnipresence of Jobs
    
I think I would be suitable for these jobs as I have a skill set similar to the requirement of these jobs and would be 
comfortable working with :
    1) Intermediate data science programming in either Python or R, including the use of popular packages
    2) Intermediate SQL queries
    3) Data cleaning
    4) Data visualization
    5) Probability and statistics
    6) Communicating complex data analysis clearly and understandably to people with no statistics 
    or programming background
    
I would also be interested in doing any kind of Data Science Internships as it would give me an on-the-job learning experience.
Also, in many cases on-the-job learning often leads a path to a permanent, full-time job.