# Classification Analysis

## Part I: Research Question

> What is the relationship between age, bandwidth usage, monthly charge, tenure, and churn?

> We will answer this question using k-nearest neighbor (KNN)

> The goal of this data analysis is to find the age, bandwidth usage, monthly charge, and tenure that churn the most.

## Part II: Method Justification

> KNN analyzes the data by inputting training and testing data. The target variable for this analysis is Churn, and the predictor variables are going to be age, bandwidth usage, monthly charge, tenure. Using this data, we will be able to find what groups are most likely and least likely to churn.

> The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. We can use this information to find that the closest things in proximity are the most related.

In [18]:
# Importing packages to be used
import numpy as np
# numpy for data analysis functionality
import pandas as pd
# pandas for dataframes
from sklearn.neighbors import KNeighborsClassifier
# KNN classifier
import matplotlib.pyplot as plt
%matplotlib inline
# matplotlib for plotting
import seaborn as sns
# seaborn for extra plotting functionality

## Part III: Data Preparation

> 1. One data preprocessing goal for this analysis will be to normalize the data. This will give us more accurate results in our analysis.

> 2. The initial variables we will use are Churn, which is categorical, and Age, mothly charge, bandwidth, and tenure, which are all continuous

> 3. Preparing the data for analysis:

Data preparation will consist of cleaning up bad data, such as replacing nulls, removing duplicated rows, etc. We will also convert categorical variables to numerical so we can run linear regression.

In [3]:
# Importing the data
df = pd.read_csv('Churn Data\churn_clean.csv')

In [4]:
# Checking for null values
df.isna().any()

CaseOrder               False
Customer_id             False
Interaction             False
UID                     False
City                    False
State                   False
County                  False
Zip                     False
Lat                     False
Lng                     False
Population              False
Area                    False
TimeZone                False
Job                     False
Children                False
Age                     False
Income                  False
Marital                 False
Gender                  False
Churn                   False
Outage_sec_perweek      False
Email                   False
Contacts                False
Yearly_equip_failure    False
Techie                  False
Contract                False
Port_modem              False
Tablet                  False
InternetService         False
Phone                   False
Multiple                False
OnlineSecurity          False
OnlineBackup            False
DeviceProt

> These are all false, meaning there are no null values

In [5]:
# Checking for duplicates
df.duplicated().any()

False

> False, meaning there are no duplicates

In [32]:
# Creating new dataframe with only the values we need
newdf = df[['Age', 'MonthlyCharge', 'Bandwidth_GB_Year','Tenure','Churn']].copy()

In [33]:
newdf

Unnamed: 0,Age,MonthlyCharge,Bandwidth_GB_Year,Tenure,Churn
0,68,172.455519,904.536110,6.795513,No
1,27,242.632554,800.982766,1.156681,Yes
2,50,159.947583,2054.706961,15.754144,No
3,48,119.956840,2164.579412,17.087227,No
4,83,149.948316,271.493436,1.670972,Yes
...,...,...,...,...,...
9995,23,159.979400,6511.252601,68.197130,No
9996,48,207.481100,5695.951810,61.040370,No
9997,48,169.974100,4159.305799,47.416890,No
9998,39,252.624000,6468.456752,71.095600,No


> 4. Copy of cleaned data:

In [34]:
# export file to csv
newdf.to_csv('cleaned_data.csv', index = False)

## Part IV: Analysis

In [35]:
# Scaling the data
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

num_cols = newdf.columns[newdf.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

scaler = StandardScaler()

newdf[num_cols] = scaler.fit_transform(newdf[num_cols])

In [36]:
# Showing the scaled data
newdf

Unnamed: 0,Age,MonthlyCharge,Bandwidth_GB_Year,Tenure,Churn
0,0.720925,-0.003943,-1.138487,-1.048746,No
1,-1.259957,1.630326,-1.185876,-1.262001,Yes
2,-0.148730,-0.295225,-0.612138,-0.709940,No
3,-0.245359,-1.226521,-0.561857,-0.659524,No
4,1.445638,-0.528086,-1.428184,-1.242551,Yes
...,...,...,...,...,...
9995,-1.453214,-0.294484,1.427298,1.273401,No
9996,-0.245359,0.811726,1.054194,1.002740,No
9997,-0.245359,-0.061729,0.350984,0.487513,No
9998,-0.680187,1.863005,1.407713,1.383018,No


In [37]:
from sklearn.model_selection import train_test_split
train , test = train_test_split(newdf, test_size = 0.25, random_state=42)

x_train = train.drop('Churn', axis=1)
y_train = train['Churn']

x_test = test.drop('Churn', axis = 1)
y_test = test['Churn']

In [38]:
knn = KNeighborsClassifier(n_neighbors=10, weights='uniform')
knn.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=10)