## Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


# OSEMN
## O - obtain data

My data set named - [**Telco Customer Churn**](https://www.kaggle.com/blastchar/telco-customer-churn) (Focused customer retention programs)

**Information from Kaggle about dataset:**
- Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
- The raw data contains 7043 rows (customers) and 21 columns (features).
- The “Churn” column is our target (customers who left) 
- Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

## S - scrub data

Preparing the data for the study, I use following steps:<br>
- Change names by team agreement
- Check duplicates
- Check data set for missing values (Null/NaN) or any wrong values
- Drop column "customerID"

In [1]:
# import important libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# if we want to see all columns, we set this parametr on
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# use custom function
%run -i 'py/dataframecheck.py'

In [2]:
# read dataset
df = pd.read_csv("Data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerid          7043 non-null object
gender              7043 non-null object
seniorcitizen       7043 non-null int64
partner             7043 non-null object
dependents          7043 non-null object
tenure              7043 non-null int64
phoneservice        7043 non-null object
multiplelines       7043 non-null object
internetservice     7043 non-null object
onlinesecurity      7043 non-null object
onlinebackup        7043 non-null object
deviceprotection    7043 non-null object
techsupport         7043 non-null object
streamingtv         7043 non-null object
streamingmovies     7043 non-null object
contract            7043 non-null object
paperlessbilling    7043 non-null object
paymentmethod       7043 non-null object
monthlycharges      7043 non-null float64
totalcharges        7043 non-null object
churn               7043 non-null object
dtypes: float64(1), int64(2), obj

In [3]:
# for each column set all letters to lower case
df.columns = map(str.lower, df.columns)

In [4]:
# Check duplicates
df.duplicated().any()

False

In [5]:
# Check data set for missing values (Null/NaN) 
df.isnull().sum(axis = 0)

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [7]:
# after closer look to column TotalCharges we found a empty values
df['totalcharges'] = df["totalcharges"].replace(" ",np.nan) 
# dropping empty values from TotalCharges column which contain .15% missing data 
df.dropna(inplace=True)
#convert to float type
df["totalcharges"] = df["totalcharges"].astype(float)

In [10]:
# customerid not useful better to drop it
df.drop(['customerid'], axis=1, inplace=True)

In [12]:
# replace 'No internet service' to 'No' for the following columns
internet_cols = [ 'onlinesecurity', 'onlinebackup', 'deviceprotection',
                'techsupport','streamingtv', 'streamingmovies']
for i in internet_cols : 
    df[i]  = df[i].replace({'No internet service': 'No'})

In [14]:
df.shape

(7032, 20)

### We ready for EDA with data set in 7032 rows and 20 columns.

-------
# E - Explore Data
The goal of this section is to get comfortable with our data. <br>
We check how each variable relates to the churn rate. For categorical features, we can use frequency table or bar plots which will calculate the number of each category in a particular variable. For numerical features, probability density plots can be used to look at the distribution of the variable.

In [15]:
# define caterogical columns and numeric columns
cat_cols=['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity',
       'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv',
       'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod']
numeric_cols=['tenure','monthlycharges', 'totalcharges']