## Important: Kindly go through the instructions mentioned below.

- The Sheet is structured in **4 steps**:
    1. Understanding data and manipulation
    2. Data visualization
    3. Implementing Machine Learning models(Note: It should be more than 1 algorithm)
    4. Model Evaluation and concluding with the best of the model.
- Good Luck! Happy Coding!

### Importing the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sn
from plotnine.data import economics
#from plotnine import ggplot, aes, geom_line
from plotnine import *

In [None]:
dataset=pd.read_csv('churn.csv')

In [None]:
dataset

## preprocessing and Basic data cleaning

Data preprocessing is a data mining technique that transforms raw data into an understandable format.
This process has four main stages – data cleaning, data integration, data transformation, and data reduction.

Data cleaning will filter, detect, and handle dirty data to ensure quality data and quality analysis results.
In this case, there may be noises of impossible and extreme values and outliers, and missing values. 
The errors may include inconsistent data and redundant attributes and data.



In [None]:
dataset.isnull().sum().sum()

In [None]:
dataset.duplicated().sum()

In [None]:
dataset.shape

In [None]:
#to find the many  unique values contained in data
dataset.nunique()
#from data there are no duplicate customers

In [None]:
#dropping the customerID
dataset.drop('customerID',axis=1,inplace=True)

In [None]:
dataset.dtypes

In [None]:
dataset.describe()

In [None]:
dataset.corr()

In [None]:
dataset.cov()

In [None]:
#heatmap if corr()
sn.heatmap(dataset.corr())

In [None]:
dataset

In [None]:
# to convert the string/object columns to categoricaland applying codes to each category
for object_columns in dataset.columns:
    if(dataset[object_columns].dtype == 'object'):
        dataset[object_columns]= dataset[object_columns].astype('category')
        dataset[object_columns] = dataset[object_columns].cat.codes
dataset 

In [None]:
#removing the outlier in the dataset
def Outlier1(data):
    threshold=3
    for i in range(len(data.columns)):
        X=data.iloc[:,i]
        outlier=[]
        
        mean_data=np.mean(X)
        print(mean_data)
        std_data=np.std(X)
        print(std_data)
        for j in X:
            z_score=(j-mean_data)/std_data
            if np.abs(z_score)>threshold:
                outlier.append(j)
        for j in outlier:
            data=data[data[data.columns[i]]!=j]
    return data

In [None]:
dataset=Outlier1(dataset)

### DATA VISUALIZATION

In [None]:
dataset

In [None]:
#histogram plot
dataset.hist(bins=50 ,figsize=(20,15))


In [None]:
#How much affect MonthlyCharges on Churn
sn.set(style="whitegrid")
sn.barplot(x='Churn',y='MonthlyCharges', data=dataset, palette='Spectral')


In [None]:
g = sn.relplot(x='Churn', y="MonthlyCharges",kind="line", data=dataset)
g.fig.autofmt_xdate()


In [None]:
#how many are dependents on target
sn.countplot(data=dataset,x='Churn',hue='Dependents')


In [None]:
contract = dataset['Contract'].value_counts()
sn.barplot(x=contract.index, y=contract.values, alpha=0.9)


In [None]:
sn.histplot(x='tenure',data=dataset,hue='Churn')


In [None]:
sn.histplot(x='SeniorCitizen',data=dataset,hue='Churn')


In [None]:
sn.histplot(x='Partner',data=dataset,hue='Churn')


In [None]:
sn.countplot( x="Churn", hue="InternetService", data=dataset)


In [None]:
sn.countplot( x="Churn", hue="PhoneService", data=dataset)


#from the above two graph we can say that In no churn Category: DSL is the most consumed product with small
difference with Optical fiber. In Churn category : the churn is significant with fiber optic consumers which give
us a prior idea that the company should pay more attention to this product and make an alarm,
because it has a huge factor of churn.

The majority of customers ( churn or not ) have the Phone Service , just a few minority doesn't have this service.

In [None]:
sn.barplot( x="tenure", y="Contract", hue="gender", data=dataset,orient="h")


In [None]:
sn.barplot( x="tenure", y="Contract", hue="PaymentMethod", data=dataset,orient="h")


In [None]:
sn.barplot( x="tenure", y="StreamingMovies", hue="Partner", data=dataset,orient="h")


In [None]:
sn.barplot( x="MonthlyCharges", y="InternetService", hue="StreamingTV", data=dataset,orient="h")


In [None]:
sn.barplot( x="tenure", y="OnlineSecurity", hue="InternetService", data=dataset)


In [None]:
sn.barplot( x="tenure", y="Contract", hue="PaperlessBilling", data=dataset,orient="h")


In [None]:
x=dataset.drop(['Churn'],axis=1)
y=dataset['Churn']


#### Normalization 

Normalization is a rescaling of the data from the original range so that all values are within the new range of 0 and 1.



In [None]:
#to minimize the range of the attributes of the dataset since 
from sklearn.preprocessing import MinMaxScaler
s = MinMaxScaler(feature_range=(0,1))
x = s.fit_transform(x)

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 
0 and the standard deviation is 1.



In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler=scaler.fit_transform(dataset)

In [None]:
#from the above two method normalization method gives higher accuracy than using standard method

### Implement Machine Learning Models

In [None]:
#applying all the methods for classicification and finding the accuracy

In [None]:
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


In [None]:
#split out the data validation
validation_size = 0.20
seed = 7
x_train, x_validation, y_train, y_validation = train_test_split(x, y, test_size=validation_size, random_state=seed)

In [None]:
# Spot-Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

In [None]:
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


In [None]:
results

In [None]:
cv_results.mean()

In [None]:
# Make predictions on validation dataset
knn = SVC(gamma='auto')
knn.fit(x_train, y_train)
predictions = knn.predict(x_validation)
print(accuracy_score(y_validation, predictions))
print(confusion_matrix(y_validation, predictions))
print(classification_report(y_validation, predictions))


In [None]:
#prediction using rf
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, x, y, cv=kfold)
print(results.mean())

#### conclusion

#from the above data visualization it is observed that


1.No significant info can be recorded with Contract , Gender and Tenure features, same behaviour between males and females.


2.Payment methods : the favorite means of payments are Electronic Check, Bank transfer and credit card, Mailed check is the
less used in all contracts types.


3.Streaming Movies : the most custmers that consume this service are partners


4.Optic fiber is expensive. hence customers are leaving out this product


5.Internet Service custmers with large tenure tend to make online Seciruty.


6.Large tenure is significant whith paperless billing company should prioritizee this mean of payment


also from the prediction  80% accuracy of the model came out from the logistic regression, svm and LDA

Himani chhokar