# Introduction to Data Science (CS4661). Cal State Univ. LA, CS Dept.
## Dr. Mo. Porhomayoun
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------


# Data Science in Python

#### This is an introduction to some data sceince libraries/packages in python. Feel free to refer to the suggested resources and documentaries for more details.

----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------


# Scikit-Learn Library (sklearn):
Scikit-learn is the Python Machine Learning Library. It includes optimal implementation of various classification, regression and clustering algorithms. It also includes hundreds of commands and functions for data preprocessing and processing along with a number of default datasets to work with.


#  
##    LOGISTIC REGRESSION and DECISION TREE CLASSIFIERS:

#### Review: The Main Steps to build (train) and use (test/predict) a predictive model in sklearn:

#### Step1: Importing the sklearn class (machine learning algorithm) that you would like to use for modeling:

In [4]:
# The following line will import LogisticRegression and DecisionTreeClassifier Classes
# "LogisticRegression" and "DecisionTreeClassifier" are the names of a "sklearn classes" to perform Logistic Regression and Decision Tree based Classification 

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [5]:
# Importing the required packages and libraries
# we will need numpy and pandas later
import numpy as np
import pandas as pd


#### Step2: Set up the Feature Matrix and Label Vector:

In [3]:
# reading a CSV file directly from Web, and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:

iris_df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv')

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [4]:
# checking the dataset by printing every 10 lines:
iris_df[0::10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
10,5.4,3.7,1.5,0.2,setosa
20,5.4,3.4,1.7,0.2,setosa
30,4.8,3.1,1.6,0.2,setosa
40,5.0,3.5,1.3,0.3,setosa
50,7.0,3.2,4.7,1.4,versicolor
60,5.0,2.0,3.5,1.0,versicolor
70,5.9,3.2,4.8,1.8,versicolor
80,5.5,2.4,3.8,1.1,versicolor
90,5.5,2.6,4.4,1.2,versicolor


In [None]:
# Creating the Feature Matrix for iris dataset:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

# use the above list to select the features from the original DataFrame
X = iris_df[feature_cols]  

# print the first 5 rows
X.head()

In [None]:
# checking the size of Feature Matix X:

print(X.shape)

In [None]:
# select a Series of labels (the last column) from the DataFrame
y = iris_df['species']

# checking the label vector by printing every 10 values
y[::10]

#### Step3: Defining (instantiating) an "object" from the sklearn class:

In [None]:
# "my_logreg" is instantiated as an "object" of LogisticRegression "class". 
# "my_decisiontree" is instantiated as an "object" of DecisionTreeClassifier "class". 


my_logreg = LogisticRegression()

my_decisiontree = DecisionTreeClassifier()

- Note that the name of the object is just an arbitrary name. We usually use a name that makes sense.
- To check the adjustable parameters and default values we can refer the sklearn documentation.

#### Step4: Training Stage: Training a predictive model using the training dataset:

#### Method "fit" is used for many sklearn classes

In [None]:
# We can use the method "fit" of the objects "my_logreg" and "my_decisiontree" along with training dataset and labels to train the model.

my_logreg.fit(X, y)

my_decisiontree.fit(X, y)

#### Step5: Testing (Prediction) Stage: Making prediction on new observations (Testing Data) using the trained model:
Now, Suppose that we have a new observation (a new data sample) with Known features [6, 3, 5.9, 2.9], and Unknown label. What would be our predition for the label of this new observation?
#### Method "predict" is used for many sklearn classes

In [None]:
# We use the method "predict" of the *trained* object knn on one or more testing data sample to perform prediction:

# Prediction for Two new data samples:

X_Testing = [[6, 3, 5.9, 2.9],[3.2, 3, 1.9, 0.3]]

y_predict_lr = my_logreg.predict(X_Testing)

y_predict_dt = my_decisiontree.predict(X_Testing)

print(y_predict_lr)
print(y_predict_dt)

## Evaluating the accuracy of our classifier:

#### Now to evaluate our model, as we learned in previous tutorial, let's split the data into training and testing sets. Then, assuming that we do NOT know the label of Testing Set, we train our model on Training Set and then test it on Testing Set. Then, we compare the "predicted labels" against the "actual labels" to check the accuracy.


In [None]:
# Randomly splitting the original dataset into training set and testing set
# The function"train_test_split" from "sklearn.cross_validation" library performs random splitting.
# "test_size=0.3" means that pick 30% of data samples for testing set, and the rest (70%) for training set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

In [None]:
# print the size of the traning set:
print(X_train.shape)
print(y_train.shape)


In [None]:
# print the size of the testing set:
print(X_test.shape)
print(y_test.shape)


### Training ONLY on the training set:

In [None]:
# Training ONLY on the training set:

my_logreg.fit(X_train, y_train)

my_decisiontree.fit(X_train, y_train)


### Testing on the testing set:

In [None]:
# Testing on the testing set:

y_predict_lr = my_logreg.predict(X_test)

y_predict_dt = my_decisiontree.predict(X_test)

# print(y_predict_lr)
# print(y_predict_dt)

### Accuracy Evaluation:


In [None]:
# We can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy 
# Function "accuracy_score" from "sklearn.metrics" will perform the element-to-element comparision and returns the 
# portion of correct predictions:

from sklearn.metrics import accuracy_score

score_lr = accuracy_score(y_test, y_predict_lr)
score_dt = accuracy_score(y_test, y_predict_dt)

print(score_lr)
print(score_dt)


#   
#    LINEAR REGRESSION:

In [None]:
# The following line will import LinearRegression "Class"

from sklearn.linear_model import LinearRegression

#### In this example, we download and work with a dataset (Advertising.csv) from the Textbook "An Introduction to Statistical Learning" 

##### About Advertising Dataset: 
Suppose that we want to improve the sales of a particular product using advertisement. The
Advertising dataset consists of the sales of that product in 200 different
markets, along with advertising budgets for the product in each of those
markets for three different media: TV, radio, and newspaper. 
(reference: An Introduction to Statistical Learning)

An Introduction to Statistical Learning:  
http://www-bcf.usc.edu/~gareth/ISL

In [None]:
# read the dataset as a CSV file directly from the book website
Ad_data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

# show the first 5 rows
Ad_data.head()

### Features:
- **TV:** The money spent on TV advertisements for the product in a given market (in 1000 dollars)
- **Radio:** The money spent on Radio advertisements for the product (in 1000 dollars)
- **Newspaper:** The money spent on Newspaper advertisements for the product (in 1000 dollars)

### Target:
- **Sales:** sales of the product (in 1000 items)

### linear regression model

General format for linear regression with multiple variables:

$y = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$


In this example:

$y = \theta_0 + \theta_1 \times TV + \theta_2 \times Radio + \theta_3 \times Newspaper$

The $\theta$ values are the **model coefficients** that will be determined in "training stage" by finding the best fitted line. Then, the fitted model will be used in testing stage to make predictions!

### Preparing Feature Matrix X and Target Vector y:

In [None]:
# Creating the Feature Matrix:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['TV', 'radio', 'newspaper']

# use the above list to select the features:
X = Ad_data[feature_cols]

# Another way to do this (notice double bracket!):
X = Ad_data[['TV', 'radio', 'newspaper']]

# check the size:
print(X.shape)

# show the first 5 rows
X.head()

In [None]:
# select the target (last column) from the DataFrame
y = Ad_data['sales']

# checking the size of target vector:
y.size

In [None]:
# Splitting the dataset into testing and training:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

### Instantiating and Training Stage:

In [None]:
# In the following line, "my_linreg" is instantiated as an "object" of LinearRegression "class". 

my_linreg = LinearRegression()

# fitting the model to the training data:
my_linreg.fit(X_train, y_train)

### Checking the Coefficients:

In [None]:
# printing Theta0 using attribute "intercept_":
print(my_linreg.intercept_)

# printing [Theta1, Theta2, Theta3] using attribute "coef_":
print(my_linreg.coef_)

### Thus, our predictive model will be: 
$$y = 3.189 + 0.045 \times TV + 0.183 \times Radio + 0.004 \times Newspaper$$

What does it mean?

### Testing and Prediction Stage:

In [None]:
# make predictions on the testing set
y_prediction = my_linreg.predict(X_test)

print(y_prediction)

#   
# Evaluation for Regression:
- So far we talked about evaluation for claasifiers: For classifiers we compared "the predicted labels" against "the actual labels" and claculated the accuracy as the percentage of correctly classified samples.
- In regression, the target is continous valued. So, we need to find the error as the average difference between the "predicted target value" and the "actual value".
- The most popular metric to quantify this difference (error) is **Root Mean Square Error** or **RMSE**:

$$RMSE = \sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
- "yi" is the actual value. "yi_hat" is the predicted value.
- Notice that **RMSE** is error. So, unlike the accuracy for classifiers, **a lower value for RMSE is better!**


In [None]:
from sklearn import metrics
import numpy as np

# Calculating "Mean Square Error" (MSE):
mse = metrics.mean_squared_error(y_test, y_prediction)

# Using numpy sqrt function to take the square root and calculate "Root Mean Square Error" (RMSE)
rmse = np.sqrt(mse)

print(rmse)

#   
# Cross-Validation

** We saw how to split the dataset into Training and Testing sets, Fit the model on "training set", and then predict on "testing set" to evaluate the accuracy. **

**The problem with this method is that the results may depend on the split. In other word, changing which data samples happen to be in the testing set can change the testing accuracy! For example, if you are lucky, some easily predictable samples may happen to be located in the testing set (or vice versa!). **

**In order to get more fair results, we can repeat the splitting process several times, compute the prediction accuracy for each split, and then average the results.**

**Cross Validation tries to reapeat the splitting procedure K times in a smart way such that all data samples will be used in "testing set" one time and in "Training Set" (K-1) times!**


## Three main steps for K-fold cross-validation
1. Split the dataset Randomly into K equal, non-overlapping sections.
2. Use one of the sections as **testing set** at a time and the union of the other (K-1) sections as the **training set**. Perform training stage, testing stage, and compute the accuracy based on the split each time. Repeat this procedure K times, so that each one of the K sections is used as **testing set** one time, and as a part of **training set** (K-1) times.
5. Calculate the average of the accuracies as final result.

Note: Using K=10 (10-fold cross-validation) is very common and recommended in machine learning.

## Cross-Validation in sklearn:

In [None]:
# importing the method:
from sklearn.cross_validation import cross_val_score

### Applying 10-fold Cross Validation for "logistic regression" classifier:


In [None]:
# Using the iris data:

feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

# Feature Matrix:
X = iris_df[feature_cols] 

# Label Vector:
y = iris_df['species'] 

print(X.shape)
print(y.shape)

In [None]:
# Applying 10-fold cross validation with "logistic regression" classifier:

# In the following line, "my_logreg" is instantiated as an "object" of LogisticRegression "class". 
my_logreg = LogisticRegression()

accuracy_list = cross_val_score(my_logreg, X, y, cv=10, scoring='accuracy')

print(accuracy_list)

#### Each element in "accuracy_list" above is the accuracy value in one of the K rounds of cross validation. We will use the average of them as the final accuracy for our model.

#### As we saw, the method "cross_val_score" will take care of everything, including splitting the data, forming Training and Testing sets (K times), Training and Testing the model (K times), and evaluating and reporting the accuracy for each round!

#### Now, we only need to calculate the average of the accuracies from K rounds!

In [None]:
# use average of accuracy values as final result
accuracy_cv = accuracy_list.mean()

print(accuracy_cv)

### Applying 10-fold Cross Validation for "linear regression":


In [None]:
# Using Ad dataset:

feature_cols = ['TV', 'radio', 'newspaper']

# feature matrix:
X = Ad_data[feature_cols]

# target vector:
y = Ad_data['sales']

print(X.shape)
print(y.shape)

In [None]:
# Applying 10-fold cross validation with "linear regression":

# In the following line, "my_linreg" is instantiated as an "object" of LinearRegression "class". 
my_linreg = LinearRegression()

mse_list = cross_val_score(my_linreg, X, y, cv=10, scoring='neg_mean_squared_error')

print(mse_list)

#### As you see, we would like to check the "mean_squared_error" (mse) for linear regression
#### Notice that "cross_val_score" by default provides "negative" values for "mse" to clarify that mse is error. In order to calculate root mean square error (rmse), we have to make them positive, and then find the avergae of them!

In [None]:
# Notice that "cross_val_score" by default provides "negative" values for "mse" to clarify that mse is error.
# in order to calculate root mean square error (rmse), we have to make them positive!
mse_list_positive = -mse_list

# using numpy sqrt function to calculate rmse:
rmse_list = np.sqrt(mse_list_positive)
print(rmse_list)


In [None]:
# calculate the average RMSE as final result of cross validation:
print(rmse_list.mean())

**References:**
- Documentation of scikit-learn 
- Documentation of pandas
- An Introduction to Statistical Learning: www-bcf.usc.edu/~gareth/ISL
- Machine Learning, Andrew Ng
- Data school website: www.dataschool.io
- kaggle website: www.kaggle.com