# Natural Language Processing Project
### README Repository Language Classification

---
## Executive Summary

1. Scrapped README's from GitHub profiles
2. Built models to predict if primary language is:
    - Java
    - JavaScript  
    - Python
    - C++
3. Gradient Boost Classification model performed the best with a 76% accuracy.

In [1]:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:.0%}'.format

import seaborn as sns
import matplotlib.pyplot as plt

import main
import acquire
import prepare
import preprocessing
import model

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeClassifierCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

from warnings import filterwarnings
filterwarnings('ignore')

## Acquire

- Manually explore sites in a web browser, and identify the relevant HTML elements from sports, data engineering, artificial intelligence, space exporation and biology written in Javascript, Python, Java and C++.
- Use the requests module to obtain the HTML from each repository.
- Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
- Parse the text of README's, language, watchers, stars, forks and commits.
- Saved raw data as a json file.

## Prepare

- Cleaned the `language`, `watchers`, `stars`, `forks` and `commits` columns of uneccessary text and changed data types as needed.
- Normalized, tokenized, stemmed, lemmatized, and removed stop words from the README column.
- Split into Train, Validate and Test datasets.

## Exploration
- Turn the readme column into a panda series and create a dataframe with the word counts for each language
- Visualize the proportion of the most frequent words by language
- Create bigrams and trigrams
- Create word clouds to visualize the most frequent words used
- Find the most frequent named entities in the corpus

In [2]:
# Exploration Visual 1

In [3]:
# Exploration Visual 2

In [4]:
# Exploration Visual 3

---
## Modeling


60-20-20 Split

### Preprocessing

<strong>TF-IDF</strong><br>
Term Frequency - Inverse Document Frequency

<strong>MinMaxScaler</strong>

In [5]:
X_train, y_train, X_validate, y_validate, X_test, y_test = model.model_data('data/model')

## Train

##### Baseline Accuracy

In [6]:
baseline_prediction = y_train.value_counts().nlargest(1).index[0]

baseline_accuracy = (y_train == baseline_prediction).mean()
print(f"The baseline accuracy is {baseline_accuracy:.0%}")
print(f"{baseline_prediction}")

The baseline accuracy is 26%
JavaScript


#### Classification Models

In [7]:
### Ridge Classifier

clf = RidgeClassifierCV()
clf.fit(X_train, y_train)
clf_train_acc = clf.score(X_train, y_train)

### Random Forest

tree = RandomForestClassifier()
tree.fit(X_train, y_train)
tree_train_acc = tree.score(X_train, y_train)

### Gradient Boost

ml = GradientBoostingClassifier()
ml.fit(X_train, y_train)
ml_train_acc = ml.score(X_train, y_train)

In [8]:
train_scores = model.model_scores(clf_train_acc,
                                  tree_train_acc,
                                  ml_train_acc,
                                  modeling_set='train')

In [9]:
train_scores

Unnamed: 0,ridge,random_forest,gradient_boost
train,100%,100%,100%


## Validate

In [10]:
### Ridge Classifier
clf_val_acc = clf.score(X_validate, y_validate)

### Random Forest
tree_val_acc = tree.score(X_validate, y_validate)

### Gradient Boost
ml_val_acc = ml.score(X_validate, y_validate)

In [11]:
validate_scores = model.model_scores(clf_val_acc,
                                     tree_val_acc,
                                     ml_val_acc,
                                     modeling_set='validate')

In [12]:
validate_scores

Unnamed: 0,ridge,random_forest,gradient_boost
validate,74%,72%,78%


## Test

In [13]:
### Gradient Boost

ml_test_acc = ml.score(X_test, y_test)

In [14]:
## Image

---
## Evaluation

Model Accuracy
- Confusion matrix
- Classification report
- ROC Curve

In [59]:
train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))
test = pd.DataFrame(dict(actual=y_test))

train['predicted'] = ml.predict(X_train)
validate['predicted'] = ml.predict(X_validate)
test['predicted'] = ml.predict(X_test)

In [60]:
print("Training Set")
print("")
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

Training Set

Accuracy: 99.72%
---
Confusion Matrix
actual      C++  Java  JavaScript  Python
predicted                                
C++          90     0           0       1
Java          0    87           0       0
JavaScript    0     0          91       0
Python        0     0           0      84
---
              precision    recall  f1-score   support

         C++       0.99      1.00      0.99        90
        Java       1.00      1.00      1.00        87
  JavaScript       1.00      1.00      1.00        91
      Python       1.00      0.99      0.99        85

    accuracy                           1.00       353
   macro avg       1.00      1.00      1.00       353
weighted avg       1.00      1.00      1.00       353



In [61]:
print("Validation Set")
print("")
print('Accuracy: {:.2%}'.format(accuracy_score(validate.actual, validate.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(validate.predicted, validate.actual))
print('---')
print(classification_report(validate.actual, validate.predicted))

Validation Set

Accuracy: 77.97%
---
Confusion Matrix
actual      C++  Java  JavaScript  Python
predicted                                
C++          22     4           5       3
Java          4    22           1       1
JavaScript    3     2          24       0
Python        1     1           1      24
---
              precision    recall  f1-score   support

         C++       0.65      0.73      0.69        30
        Java       0.79      0.76      0.77        29
  JavaScript       0.83      0.77      0.80        31
      Python       0.89      0.86      0.87        28

    accuracy                           0.78       118
   macro avg       0.79      0.78      0.78       118
weighted avg       0.79      0.78      0.78       118



In [62]:
print("Test Set")
print("")
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

Test Set

Accuracy: 76.27%
---
Confusion Matrix
actual      C++  Java  JavaScript  Python
predicted                                
C++          22     3           3       1
Java          3    21           0       1
JavaScript    3     4          25       4
Python        2     1           3      22
---
              precision    recall  f1-score   support

         C++       0.76      0.73      0.75        30
        Java       0.84      0.72      0.78        29
  JavaScript       0.69      0.81      0.75        31
      Python       0.79      0.79      0.79        28

    accuracy                           0.76       118
   macro avg       0.77      0.76      0.76       118
weighted avg       0.77      0.76      0.76       118



## Appendix

In [63]:
# tf-idf L2 Normalization

for i in range(3):
    print(f"Training observation {i+1}: L2 Norm --- Euclidian distance = {sum(X_train.iloc[i]**2):.0f}")

Training observation 1: L2 Norm --- Euclidian distance = 1
Training observation 2: L2 Norm --- Euclidian distance = 1
Training observation 3: L2 Norm --- Euclidian distance = 1
