# Identify supervised learning examples

You have learned about the differences between supervised and unsupervised learning techniques. You will now review several use cases to build an even stronger distinction. Which of these is NOT an example of supervised learning?

### Possible Answers


    Predict whether a company will default on a loan.
    
    
    Identify how many TV shows will a customer watch next month.
    
    
    Build customer segments using recency, frequency and monetary values. {Answer}
    
    
    Predict which cars will need a mechanic given their technical characteristics.

In [112]:
import pandas as pd 

telco = pd.read_csv('datasets/telco.csv')
for i in range(len(telco.TotalCharges)):
    if telco.iloc[i, -2] == " ":
        telco.iloc[i, -2] = 0
    else:
        continue

telco.loc[:, 'TotalCharges'] = telco.loc[:, 'TotalCharges'].astype(float)

In [113]:
# exercise 01

"""
Supervised vs. unsupervised learning

Great work! You now know a lot about the differences between supervised and unsupervised learning. For this exercise, a telecom churn dataset named telco has been loaded for you. The last column called Churn defines whether or not a specific customer has churned. You will explore this dataset and determine whether it fits the supervised or unsupervised data format.
"""

# Instructions

"""

    Print the first 5 rows of the telco dataset by printing its header.
---
Question

As you learned in the video exercise, there are differences in required data formats. This dataset is designed to predict churn with a supervised model. Which of these columns should NOT be included in a unsupervised learning model?
Possible answers:
    
    tenure
    
    StreamingTV
    
    Churn {Answer}
    
    PaymentMethod
"""

# solution

print(telco.head())

#----------------------------------#

# Conclusion

"""
That's correct! This dataset was built for churn prediction and the last column is the target variable that is not used in unsupervised learning models.
"""

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

"\nThat's correct! This dataset was built for churn prediction and the last column is the target variable that is not used in unsupervised learning models.\n"

In [114]:
# exercise 02

"""
Investigate the data

Great work so far! Now you know the key techniques to explore and prepare datasets for supervised machine learning models. You will now test your knowledge in practice. In this exercise, you will explore the key characteristics of the telecom churn dataset. You should run each line separately before submitting the assignment so you get valuable information about the dataset. The pandas module has been loaded for you as pd.

The raw telecom churn dataset telco_raw has been loaded for you as a pandas DataFrame. You can familiarize yourself with the dataset by exploring it in the console.
"""

# Instructions

"""

    Print the data types of telco_raw.

    Print the header of telco_raw.

    Print the number of unique values in each telco_raw column.

"""

# solution

# Print the data types of telco_raw dataset
print(telco.dtypes)

# Print the header of telco_raw dataset
print(telco.head())

# Print the number of unique values in each telco_raw column
print(telco.nunique())

#----------------------------------#

# Conclusion

"""
Great! You have explored the dataset and now know key elements of its structure.
"""

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No  

'\nGreat! You have explored the dataset and now know key elements of its structure.\n'

In [115]:
telco_raw = telco.copy()

In [116]:
# exercise 03

"""
Separate numerical and categorical columns

In the last exercise, you have explored the dataset characteristics and are ready to do some data pre-processing. You will now separate categorical and numerical variables from the telco_raw DataFrame with a customized categorical vs. numerical unique value count threshold. The pandas module has been loaded for you as pd.

The raw telecom churn dataset telco_raw has been loaded for you as a pandas DataFrame. You can familiarize with the dataset by exploring it in the console.
"""

# Instructions

"""

    Store customerID and Churn column names.

    Assign to categorical the column names that have less than 5 unique values.

    Remove target from the list.

    Assign to numerical all column names that are not in the custid, target and categorical.

"""

# solution

# Store customerID and Churn column names
custid = ['customerID']
target = ['Churn']

# Store categorical column names
categorical = telco_raw.nunique()[telco_raw.nunique() < 5].keys().tolist()

# Remove target from the list of categorical variables
categorical.remove(target[0])

# Store numerical column names
numerical = [x for x in telco_raw.columns if x not in custid + target + categorical]

#----------------------------------#

# Conclusion

"""
Fantastic! You have separated the categorical and numerical columns and are ready to transform them!
"""

'\nFantastic! You have separated the categorical and numerical columns and are ready to transform them!\n'

In [117]:
from sklearn.preprocessing import StandardScaler

In [118]:
# exercise 04

"""
Encode categorical and scale numerical variables

In this final step, you will perform one-hot encoding on the categorical variables and then scale the numerical columns. The pandas library has been loaded for you as pd, as well as the StandardScaler module from the sklearn.preprocessing module.

The raw telecom churn dataset telco_raw has been loaded for you as a pandas DataFrame, as well as the lists custid, target, categorical, and numerical with column names you have created in the previous exercise. You can familiarize yourself with the dataset by exploring it in the console.
"""

# Instructions

"""

    Perform one-hot encoding on the categorical variables.

    Initialize a StandardScaler instance.

    Fit and transform the scaler on the numerical columns.

    Build a DataFrame from scaled_numerical.

"""

# solution

# Perform one-hot encoding to categorical variables 
telco_raw = pd.get_dummies(data = telco_raw, columns = categorical, drop_first=True)

# Initialize StandardScaler instance
scaler = StandardScaler()

# Fit and transform the scaler on numerical columns
scaled_numerical = scaler.fit_transform(telco_raw[numerical])

# Build a DataFrame from scaled_numerical
scaled_numerical = pd.DataFrame(scaled_numerical, columns=numerical)

#----------------------------------#

# Conclusion

"""
Fantastic! Great work in one-hot encoding categorical variables and scaling the numerical ones!
"""

'\nFantastic! Great work in one-hot encoding categorical variables and scaling the numerical ones!\n'

In [119]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
telco_raw.replace({"Churn": {'No':0, "Yes":1}}, inplace=True)
X = telco_raw.drop(columns=['Churn','customerID'], axis=1)
Y = telco_raw.loc[:,'Churn']

  telco_raw.replace({"Churn": {'No':0, "Yes":1}}, inplace=True)


In [120]:
# exercise 05

"""
Split data to training and testing

You are now ready to build an end-to-end machine learning model by following a few simple steps! You will explore modeling nuances in much more detail in the next chapters, but for now you will practice and understand the key steps.

The independent features have been loaded for you as a pandas DataFrame named X, and the dependent values as a pandas Series named Y.

Also, the train_test_split function has been loaded from the sklearn library. You will now create training and testing datasets, and then make sure the data was correctly split.
"""

# Instructions

"""

    Split X and Y into train and test sets with 25% of the data split into testing.

    Ensure that the training dataset has only 75% of original data.

    Ensure that the testing dataset has only 25% of original data.

"""

# solution

# Split X and Y into training and testing datasets
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=0.25)

# Ensure training dataset has only 75% of original X data
print(train_X.shape[0] / X.shape[0])

# Ensure testing dataset has only 25% of original X data
print(test_X.shape[0] / X.shape[0])

#----------------------------------#

# Conclusion

"""
Good job! You have successfully split the data into training and testing, and are now ready to build machine learning model on them!
"""

0.7499645037626012
0.25003549623739885


'\nGood job! You have successfully split the data into training and testing, and are now ready to build machine learning model on them!\n'

In [127]:
# exercise 06

"""
Fit a decision tree

Now, you will take a stab at building a decision tree model. The decision tree is a list of machine-learned if-else rules that decide in the telecom churn case, whether customers will churn or not. Here's an example decision tree graph built on the famous Titanic survival dataset.

The train_X, test_X, train_Y, test_Y from the previous exercise have been loaded for you. Also, the tree module and the accuracy_score function have been loaded from the sklearn library. You will now build your model and check its performance on unseen data.
"""

# Instructions

"""

    Initialize the decision tree model with max_depth set at 5.

    Fit the model on the training data, first train_X, then train_Y.

    Predict values of the testing data, or in this case test_X.

    Measure your model's performance on the testing data by comparing between your actual test labels and predicted ones.

"""

# solution

# Initialize the model with max_depth set at 5
mytree = tree.DecisionTreeClassifier(max_depth = 5)

# Fit the model on the training data
treemodel = mytree.fit(train_X, train_Y)

# Predict values on the testing data
pred_Y = treemodel.predict(test_X)

# Measure model performance on testing data
print(accuracy_score(test_Y, pred_Y))

#----------------------------------#

# Conclusion

"""
Fantastic! You have just built a decision tree predicting churn with 77.7% accuracy!
"""

0.8040885860306644


'\nFantastic! You have just built a decision tree predicting churn with 77.7% accuracy!\n'

In [129]:
import numpy as np

In [130]:
# exercise 07

"""
Predict churn with decision tree

Now you will build on the skills you acquired in the earlier exercise, and build a more complex decision tree with additional parameters to predict customer churn. You will dive deep into the churn prediction problem in the next chapter. Here you will run the decision tree classifier again on your training data, predict the churn rate on unseen (test) data, and assess model accuracy on both datasets.

The tree module from the sklearn library has been loaded for you, as well as the accuracy_score function from sklearn.metrics. The features and target variables have also been imported as train_X, train_Y for training data, and test_X, test_Y for test data.
"""

# Instructions

"""

    Initialize a Decision tree with maximum depth set to 7 and by using the gini criterion.

    Fit the model to the training data.

    Predict the values on the test dataset.

    Print the accuracy values for both training and test datasets.

"""

# solution

# Initialize the Decision Tree
clf = tree.DecisionTreeClassifier(max_depth = 7, 
               criterion = 'gini', 
               splitter  = 'best')

# Fit the model to the training data
clf = clf.fit(train_X, train_Y)

# Predict the values on test dataset
pred_Y = clf.predict(test_X)

# Print accuracy values
print("Training accuracy: ", np.round(clf.score(train_X, train_Y), 3)) 
print("Test accuracy: ", np.round(accuracy_score(test_Y, pred_Y), 3))

#----------------------------------#

# Conclusion

"""
Great results! With no parameter tuning you are accurate in around 3/4 of the cases - these are impressive results!
"""

Training accuracy:  0.817
Test accuracy:  0.806


'\nGreat results! With no parameter tuning you are accurate in around 3/4 of the cases - these are impressive results!\n'