# THE LANGUAGE of LIFE EXPECTANCY: 

#### A Natural Language Processing Evaluation of GitHub Repository Content Programming Language

---

**Natural Language Processing Project & Final Report Created By:**  

Chris Teceno, Rachel Robbins-Mayhill, Kristofer Rivera     
  Codeup     **|**     Innis Cohort     **|**     May 2022  

<img src='languages.png' width="1500" height="700" align="center"/>

## Project Goal
The goal of this project is to build a Natural Language Processing (NLP) model that can predict the programming language of projects within specified GitHub repositories, given the text of a README.md file. 

## Project Description

This project was initiated by utilizing web scraping techniques to scrape README files from specified GitHub repositories focused on Life Expectancy projects. The 130 most starred Life Expectancy Repositories were used as the documents within the corpus for this NLP project. 

After acquiring and preparing the corpus, our team conducted natural language processing exploration methods such as word clouds, bigrams and trigrams. We employed multiclass classification methods to create multiple machine learning models. The end goal was to create an NLP model that accurately predicted the programming language used in a GitHub repository based on the words and word combinations found in the readme files.

## Initial Thoughts & Hypothesis: 



## Initial Questions:
Data-Focused Questions





## EXECUTIVE SUMMARY

===============================================================================================================================================

## I. ACQUIRE

### Note about imports:

Imports for this project are added in the sections in which they are required.

In [2]:
# import for acquisition
import os
import json
import requests
import wrangle
from env import github_token, github_username

# import for data manipulation
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, cast

# import to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
# acquire data from .json saved and processed using functions found in wrangle.py
df = pd.read_json("data.json")
df.head()

Unnamed: 0,repo,language,readme_contents
0,mcastrolab/Brazil-Covid19-e0-change,R,# Reduction in life expectancy in Brazil after...
1,jschoeley/de0anim,R,# Animated annual changes in life-expectancy\n...
2,sychi77/Thoracic_Surgery_Patient_Survival,Jupyter Notebook,# Thoracic Surgery for Lung Cancer Data Set\n ...
3,ashtad63/HackerRank-Data-Scientist-Hiring-Test,Jupyter Notebook,# HackerRank Data Scientist Hiring Test: Predi...
4,OxfordDemSci/ex2020,R,"<p align=""center"">\n <img src=""https://github..."


In [4]:
# obtain number of columns and rows for original dataframe
df.shape

(166, 3)

## The Original DataFrame Size: 150,000 rows, or observations, and 12 columns.

=================================================================================================================================================================================================================================

## II. PREPARE

After data acquisition, the table was analyzed and cleaned to facilitate functional exploration and clarify variable confusion. The preparation of this data can be replicated using the ____________  function saved within the wrangle.py file inside the 'NLP-Project' repository on GitHub. The function takes in the original data.json dataframe and returns it with the changes noted below.

**Steps Taken to Clean & Prepare Data:**

- Delete "Unnamed' index
- Rename columns for understanding, while making lowercase for ease of understanding and coding through exploration
- Drop missing values (29_731 in monthly_income and 3_924 in quantity_dep) to prevent impediments in exploration and modeling 
- Create categorical columns for binning age and dependents in order to visualize and identify relationships more easily in exploration
- Manually scaled monthly income to include only those with monthly income below\\$15,000 to eliminate outliers

**Note on Missing Value Handling:**
The missing value removal equated to removing     observations, which was about ___\% of the data set. It still left a substantial number of observations, above 1100. If given more time with the data, it is recommended to _______.

---

### Results of Data Preparation

In [None]:
# apply the data preparation observations and tasks to clean the data using the wrangle_client function found in the wrangle.py
df = wrangle.___________(df)
# view first few rows of dataframe
df.head()

In [None]:
# obtain the number of rows and columns for the updated/cleaned dataframe. 
df.shape

## Prepared DataFrame Size: 120,269 rows, 11 columns

---

### PREPARE - SPLIT

After preparing the data, it was split into 3 samples; train, validate, and test using:

- Random State: 42
- Test = 20% of the original dataset
- The remaining 80% of the dataset is divided between valiidate and train
    - Validate (.30*.80) = 24% of the original dataset
    - Train (.70*.80) = 56% of the original dataset
    
The split of this data can be replicated using the split_data function saved within the wrangle.py file inside the 'NLP-Project' repository on GitHub, located [here](https://github.com/Two-Guys-and-a-Gal/NLP-Project) here.

In [None]:
# split the data into train, validate, and test using the split_data function found in the wrangle.py
train, validate, test = wrangle.split_data(df)

=================================================================================================================================================================================================================================

## III. EXPLORE

After acquiring and preparing the data, exploration was conducted. All univariate exploration was completed on the entire cleaned dataset in the workbook for this project. For the purpose of the final report, only the target variable will be displayed in order to reduce noise and provide focused context for the project. Following univariate exploration, the data was split into train, validate, and test samples, where only the train set was used for bivariate and multivariate exploration to prevent data leakage.

In [None]:
# import for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter

---

### UNIVARIATE EXPLORATION

#### UNIVARIATE EXPLORATION of TARGET VARIABLE

In [None]:
# obtain counts of target variable and display them through print statement
print(df.serious_delinquency.value_counts())


# plot figure size
plt.figure(figsize=(15,9))
# define variable to plot
y = df.serious_delinquency.value_counts()
# format title and font
plt.title('Serious Delinquency', fontsize=20)
# create pie chart
plt.pie(y)
plt.show() 

# obtain percentage breakdown of target variable and round it to the nearest hundredth
print("Percent Seriously Delinquent from Overall Dataset")
round(df.serious_delinquency.mean(), 3)

**OBSERVATIONS:** 


#### UNIVARIATE EXPLORATION SUMMARY:

---

### EXPLORATION

---

### EXPLORATION SUMMARY

=================================================================================================================================================================================================================================

## IV. MODEL

In [5]:
# Import for modeling
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

### MODELING CONFUSION MATRIX
- True Positive: number of occurrences where serious delinquency is true and serious delinquency is predicted true.
    - protects borrower and institution from granting credit when high-risk (Optimize)
- True Negative: number of occurrences where serious delinquency is false and serious delinquency is predicted false.
    - allows those who do not have serious delinquency to access credit
- False Positive: number of occurrences where serious delinquency is false and serious delinquency is predicted true.
    - prevents non-at-risk borrowers from receiving credit, which could impact the borrower and institution negatively, but the instiution less-so
- False Negative: number of occurrences where serious delinquency is true and serious delinquency is predicted false.
    - allows at-risk borrowers access to credit; this could be the most costly outcome to the borrower and the institution (Optimize)

Because serious delinquency is a yes or no (boolean) value, classification machine learning algorithms were used to fit to the training data and the models were evaluated on validate data. The best model was selected using accuracy, but also with an eye for recall. In other words, the model was optimized for identifying true positives (actual delinquency when predicted), and false negatives (serious delinquency when predicted not to have serious delinquency).

### MODEL - SCALE
As previously mentioned, scaling was done manually to monthly income inside the prep function. Monthly income was scaled to include only observations below \\$15,000 to be more in line with the typical borrower. By including only the observations noted, 95\% of the data was still retained. No other scaling was conducted at this time.

### Set X & y

In [None]:
# create X & y version of train, where y is a series with just the target variable and X are all the features. 
X_train = train.drop(columns=['serious_delinquency','quantity_90_days_pd', 'age_bins', 'quantity_dependents_bins'])
y_train = train.serious_delinquency

X_validate = validate.drop(columns=['serious_delinquency', 'quantity_90_days_pd', 'age_bins', 'quantity_dependents_bins'])
y_validate = validate.serious_delinquency

X_test = test.drop(columns=['serious_delinquency', 'quantity_90_days_pd', 'age_bins', 'quantity_dependents_bins'])
y_test = test.serious_delinquency

### Set Baseline

A baseline prediction was set by predicting all repositories will have ____________ as the programming language. We will evaluate the accuracy of our models in comparrison to that baseline.

In [None]:
# obtain the mode for the target
baseline = y_train.mode()

# produce boolean array with True assigned to match the baseline prediction and real data. 
matches_baseline_prediction = (y_train == 0)

baseline_accuracy = matches_baseline_prediction.mean()

print(f'Baseline Accuracy: {baseline_accuracy:.2%}')

#### The 3 models built were 
- Decision Tree
- Random Forest
- Logistic Regression

The models were run with many trials, adjusting parameters and algorithms to find the best performing model.  

- None of these model appeared to be overfit.
- None of the models performed significantly better than baseline.
- The Random Forest Model that performed best had 12 samples_per_leaf and max_depth of 12, with train accuracy of 94% and validate accuracy of 93% performing only 1% better than baseline with validate. It was then applied to the un-seen test data.

### MODEL - DECISION TREE

In [None]:
#Create the object
clf1 = DecisionTreeClassifier(max_depth=2, random_state=123)
# Fit the model
clf1 = clf1.fit(X_train, y_train)

In [None]:
# make predictions
y_pred = clf1.predict(X_train)

# estimate probability
y_pred_proba = clf1.predict_proba(X_train)
pd.DataFrame(y_pred_proba, columns = ['Not Seriously Delinquent', 'Seriously Delinquent']).head()

### Evaluate Model

In [None]:
# obtain accuracy of model
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
      .format(clf1.score(X_train, y_train)))

In [None]:
# obtain classification report to look at model results
print(classification_report(y_train, y_pred))

### Evaluate the Model with our Validate dataset

In [None]:
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'
     .format(clf1.score(X_validate, y_validate)))

In [None]:
# Produce y_predictions that come from the X_validate
y_pred = clf1.predict(X_validate)

# Compare actual y values (from validate) to predicted y_values from the model run on X_validate
print(classification_report(y_validate, y_pred))

### Model - RANDOM FOREST

In [None]:
# Evaluate Random Forest models on train & validate set 
# by looping through different values for max_depth and min_samples_leaf hyperparameters

# create empty list for which to append metrics from each loop
scores = []
max_value = range(1,21)
# create loop for range 1-20
for i in max_value:
    # set depth & n_samples to value for current loop
    depth = i
    n_samples = i
    # define the model setting hyperparameters to values for current loop
    forest = RandomForestClassifier(max_depth=depth, min_samples_leaf=n_samples, random_state=123)
    # fit the model on train
    forest = forest.fit(X_train, y_train)
    # use the model and evaluate performance on train
    in_sample_accuracy = forest.score(X_train, y_train)
    # use the model and evaluate performance on validate
    out_of_sample_accuracy = forest.score(X_validate, y_validate)
    # create output of current loop’s hyperparameters and accuracy to append to metrics
    output = {
        'min_samples_per_leaf': n_samples,
        'max_depth': depth,
        'train_accuracy': in_sample_accuracy,
        'validate_accuracy': out_of_sample_accuracy
    }
    scores.append(output)
# convert metrics list to a dataframe for easy reading
df = pd.DataFrame(scores)
# add column to assess the difference between train & validate accuracy
df['difference'] = df.train_accuracy - df.validate_accuracy
df

The Random Forest model that performed the best on train & validate set had max_depth of ____ and min_sample_leaf of _____, with____ accuracy on train, and _____ accuracy on validate, so that model will be isolated below in the event it is the best performing model to be applied to the test (unseen) dataset. 

In [None]:
# define the model setting hyperparameters to values for the best performing model
forest = RandomForestClassifier(max_depth=13, min_samples_leaf=13, random_state=123)

# fit the model on train
forest = forest.fit(X_train, y_train)

# use the model and evaluate performance on train
train_accuracy = forest.score(X_train, y_train)
# use the model and evaluate performance on validate
validate_accuracy = forest.score(X_validate, y_validate)

print(f'train_accuracy: {train_accuracy: 2%}')
print(f'validate_accuracy: {validate_accuracy: 2%}')

---

### Model - LOGISTIC REGRESSION

In [None]:
# Evaluate Logistic Regression models on train & validate set by looping through different values for c hyperparameter

# create empty list for which to append metrics from each loop
metrics = []

# create loop for values in list
for c in [.001, .005, .01, .05, .1, .5, 1, 5, 10, 50, 100, 500, 1000]:
            
    # define the model setting hyperparameters to values for current loop
    logit = LogisticRegression(C=c)
    
    # fit the model on train
    logit.fit(X_train, y_train)
    
    # use the model and evaluate performance on train
    train_accuracy = logit.score(X_train, y_train)
    # use the model and evaluate performance on validate
    validate_accuracy = logit.score(X_validate, y_validate)
    
    # create output of current loop's hyperparameters and accuracy to append to metrics
    output = {
        'C': c,
        'train_accuracy': train_accuracy,
        'validate_accuracy': validate_accuracy
    }
    
    metrics.append(output)

# convert metrics list to a dataframe for easy reading
df = pd.DataFrame(metrics)
# add column to assess the difference between train & validate accuracy
df['difference'] = df.train_accuracy - df.validate_accuracy
df

Evaluating the model with the validate data set was done in the function above for comparrison. The Logistic Regression Model that performed best had a c-statistic of _____ with a train accuracy of_____ and validate accuracy of ____ performing ______ as baseline on unseen (validate) data.

### Best Performing Model Applied to Test Data (Unseen Data)

All of the best performing models performed equivalent to baseline, at _____ accuracy. The Random Forest model that had max_depth of _____ and min_sample_leaf of _____, performed just slightly higher than baseline with _____ accuracy on train, and _____ accuracy on validate. Due to its preferred performance out of the alternative models, it was applied to the test set.

In [None]:
# Evaluate Random Forest model on train & validate set

# define the model setting hyperparameters to values for current loop
forest = RandomForestClassifier(max_depth=12, min_samples_leaf=12, random_state=123)

# fit the model on train
forest = forest.fit(X_train, y_train)

# use the model and evaluate performance on train

# use the model and evaluate performance on validate
test_accuracy = forest.score(X_test, y_test)

print(f'test_accuracy: {test_accuracy: 2%}')

This model is expected to perform with 93% accuracy in the future on data it has not seen, given no major changes in the data source, which is equivalent to baseline prediction.

=========================================================================================================================================================================================

## V. CONCLUSION

The goal of this report was to identify drivers of borrower serious delinquency and to build a model that could be used to help borrowers and banking institutions make the best financial decisions.  
Through the process of data acquisition, preparation, exploration, and statistical testing, it was determined borrowers more at-risk of serious delinquency are borrowers: 

- between the ages of 20 and 40
- who have an average monthly income of \\$4900, which is about \\$800 lower than those who are not seriously delinquent  
- who have an average debt to income ratio of 16\%, 13\% lower than those who are not seriously delinquent
- who have an average revolving credit utilization rate of 315\%, over 250\% **lower** than those who are not seriously delinquent

By using machine learning modeling, predictions to identify serious delinquency were made with 93% accuracy using the best performing model, a Random Forest model with max depth and minimum leaf sample of 12. This is not significantly better than baseline, which is also at 93% accuracy.

### RECOMMENDATIONS

- USE THE IDENTIFIED DRIVERS of SERIOUS DELINQUENCY: The data shows age and revolving unsecured line utilization have the largest impact on serious delinquency. These features should be used going forward to predict serious delinquency, as well as a starting point for conducting further analysis.  

- BUILD UPON MODEL PERFORMANCE: Although the Polynomial Regression Model, with max depth and minimum leaf sample of 12 does predict serious delinquency with 93% accuracy on unseen data, the model does not perform better than baseline and it could continue to be improved upon through deeper analysis of features and their impact on serious delinquency. See next steps. 

- CONTINUE to PINPOINT FEATURES DRIVING SERIOUS DELINQUENCY: This report focused on age, monthly income, debt to income ratio, and revolving unsecured line utilization as initial contributors to serious delinquency. It is recommended further analysis and modeling be done with additional features in order to create models with improved performance. 

In [None]:
### NEXT STEPS

If given more time, we would like to:

- Eliminate more outliers and scale further aspects of the data to produce more accurate modeling/predictions.

- Downsample to decrease the quantity of the majority class (no serious delinquency) in order to better see the impact of features on the target.

- Identify more precise populations within the data that have serious delinquency. For instance, I would like to investigate:
    - the 144 observations that have over 98 occurrences of serious delinquency.
    - the relationship between intermediate delinquencies (30 days, 60 days) and the target.

===================================================================================================================================