# Class project presentation
- every student gives a 5 minute presentation of their class project
- those who do not give presentation will get no credit for class project at all
- presentation style is free - power point, jupyter notebook, _[jupyter notebook slideshow](https://github.com/damianavila/RISE)_, etc
- up to 5 minutes Q&A session after each presentation
- class is divided in 4 B/O rooms/ groups (currently 27 projects)
- in each group one student is designated to record the session and being the first chairman
- Step 1:   chairman chooses a presenter, keeps track of time for presentation and Q&A (if no questions for Q&A chairman has to ask one)
- Step 2:   presenter that just finished their presentation and Q&A becomes new chairman
- go back to step 1 until everybody have finished their presentations
- student that recorded all presentations will send them to teacher after class
- teacher and TA will rotate through groups and listen to presentations

# Data Science Tutorial

- an example of how to use Python for Data Science

- following _[A Complete Python Tutorial to Learn Data Science from Scratch](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/)_

- details on packages (pandas, matplotlib, etc) and related code will be discussed in the upcoming classes

- today's plan


#### Table of content

1. Basics of Python for Data Analysis
        Why learn Python for data analysis?
        Python 2.7 v/s 3.4
        How to install Python?
        Running a few simple programs in Python
   
   
2. Python libraries and data structures
        Python Data Structures (lists, strings, tuples, dictionaries)
        Python Iteration and Conditional Constructs
        Python Libraries


3. Exploratory analysis in Python using Pandas
        Introduction to series and dataframes
        Analytics Vidhya dataset - Loan Prediction Problem


4. Data Munging in Python using Pandas


5. **Building a Predictive Model in Python**
        Logistic Regression
        Decision Tree
        Random Forest


## Building a Predictive Model in Python

<img src="scikit.png" alt="Drawing" style="width: 250px;"/>

- now the data is useful for modeling
- we use Skicit-Learn to create a predictive model on our data set
- sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories

In [6]:
#change categorical variables into numeric
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])
df.dtypes 

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']

In [None]:
df.head()

In [None]:
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold   #For K-fold cross validation
#将训练/测试数据集划分n_splits个互斥子集，每次用其中一个子集当作验证集，剩下的n_splits-1个作为训练集，
#进行n_splits次训练和测试，得到n_splits个结果
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  kf = KFold(n_splits=5)
  error = []
  for train, test in kf.split(data):
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome]) 


### Logistic Regression model

- One way would be to take all the variables into the model 
- might result in overfitting (understanding complex relations specific to the data and not generalize well)
- set some intuitive hypothesis
    
    1. Applicants having a credit history 
    2. Applicants with higher applicant and co-applicant incomes
    3. Applicants with higher education level
    4. Properties in urban areas with high growth perspectives


In [None]:
#first model with Credit_History
outcome_var = 'Loan_Status'
model = LogisticRegression()
#model = LogisticRegression(solver='lbfgs')
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)

In [None]:
#We can try different combination of variables:
predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model, df,predictor_var,outcome_var)

- we expect the accuracy to increase on adding variables, but this is more challenging case
- the accuracy and cross-validation score are not getting impacted by less important variables
- Credit_History is dominating the mode
- what to do?
    
    1. Feature Engineering: derive new information and try to predict those. I will leave this to your creativity.
    2. Better modeling techniques. Let’s explore this next.

    

### Decision Tree

- another method for making a predictive model
- known to provide higher accuracy than logistic regression model

In [None]:
model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)

- the model based on categorical(明確的) variables is unable to have an impact because Credit History is dominating over them

In [None]:
#We can try different combination of variables:
predictor_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model, df,predictor_var,outcome_var)

- although the accuracy went up on adding variables, the cross-validation error went down
- this is the result of model over-fitting the data

### Random Forest

- another method for making a predictive model
- advantage with Random Forest is that we can make it work with all the features and it returns a feature importance matrix which can be used to select features

In [None]:
model = RandomForestClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
        'LoanAmount_log','TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)

- the accuracy is 100% for the training set
- the ultimate case of overfitting
- can be resolved in two ways
    
    1. Reducing the number of predictors
    2. Tuning the model parameters

- first let's see the feature importance matrix from which we’ll take the most important features

In [None]:
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)

- let’s use the top 5 variables for creating a model
- also modify the parameters of random forest model a little bit

In [None]:
model = RandomForestClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1)
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area']
classification_model(model, df,predictor_var,outcome_var)

- although accuracy reduced, but the cross-validation score is improving
- it shows that the model is generalizing well
- random forest models are not exactly repeatable (different runs will result in slight variations because of randomization)

### Closure

- even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model
- his exercise gives us some very interesting and unique learning
    
    1. Using a more sophisticated model does not guarantee better results.
    2. Avoid using complex modeling techniques as a black box without understanding the underlying concepts. Doing so would increase the tendency of overfitting thus making your models less interpretable
    3. Feature Engineering is the key to success. Everyone can use an Xgboost models but the real art and creativity lies in enhancing your features to better suit the model.
