<p align= "center">
<img width="300" src="https://logos-world.net/wp-content/uploads/2020/11/GitHub-Emblem.png" alt="Github Logo">
</p>


<h1 align = "center">Programming Language Detection</h1>

<h2 align = "center">By Chloe Whitaker, Jeanette Schulz, Brian Clements, and Paige Guajardo </h2>
<h4 align = "center">11 February 2022</h4>



<hr style="border:2px solid blue"> </hr>

# About this Project
### Github Webscraping and Natural Language Processing
Millions of developers and companies build, ship, and maintain their software on GitHub— the largest and most advanced development platform in the world. As Codeup's new up-and-coming Data Scientists, we will be using GitHub's platform to practice both our Web-Scraping skills and our Natural Language Processing (NLP) skills. With a focus on repositories that are studying bitcoin, our goal is to predict the programming language used in a repository based solely on the README.md file provided. By exploring the text provided in the README, we hope to identify key words that will allow us to identify which programming language(s) were used. Then we will teach these to our classification model so that it will predict the programming language of any future repositories we show it. For our project, we focused on five most common languages from Bitcoin repositories and named the rest 'Other'. The list of languages we will try to predict are: 
- JavaScript         
- Python             
- C++                 
- PHP                 
- C                   
- HTML                 
- Go                   
- Ruby                 
- Java                 

### Project Goal: 

The goal is to scrape README pages from BitCoin related repositories on GitHub, so that we can predict the language of that repository. 

### Project Desciption: 

Language Predictor. This is a group project where we will be scraping GitHub repository README files related to BitCoin. We will then produce a classification model that will be able to predict the programming language of that repository using only the text in the README. 

- List of the languages whose detection is supported: 
    - JavaScript
    - Python
    - Go
    - C++
    - Java
    - TypeScript
    - HTML
    - PHP
    - C#


# Data Dictionary

| Feature                    | Datatype               | Description                                                           |
|:---------------------------|:-----------------------|:----------------------------------------------------------------------|
| repo                       | 847330 non-null: object  | feature described here             |
| language                       | 847330 non-null: object  | feature described here             |
| readme_contents                       | 847330 non-null: object  | feature described here             |


<hr style="border:2px solid blue"> </hr>

### Imports

Here are the imports needed to run this notebook.

In [1]:
import pandas as pd
import numpy as np

# Scraping
import requests
from bs4 import BeautifulSoup

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import nltk.sentiment

from wordcloud import WordCloud
# pd.set_option('display.max_colwidth', -1)

# Regex
import re

# Time
from time import strftime

import unicodedata
import json
from pprint import pprint

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Custom Imports
# import acquire 
import prepare
import wrangle
import model
# Turn off pink boxes for demo
import warnings
warnings.filterwarnings("ignore")

--------

# Let's Get Started...

-----

## Wrangle

Use the wrangle.py helper file to acquire and prepare the GitHub README data. 

In [2]:
# df=acquire.make_json(cached=True)
df = pd.read_json('repo_readmes_10_feb_am.json')

In [3]:
df = wrangle.brian_quick_clean(df)
df.head()

Unnamed: 0,repo,language,lemmatized
0,using-system/LightningPay,Other,lightningpay bitcoin lightning network payment...
1,drminnaar/react-bitcoin-monitor,JavaScript,react bitcoin monitor app monitor change _bitc...
2,lbryio/lbrycrd,C++,lbrycrd lbry blockchain build statushttpstravi...
3,ElementsProject/lightning-charge,JavaScript,lightning charge build statushttpsapitraviscio...
4,kilimchoi/cryptocurrency,Other,check httpscoinbuddycocoins track exchange sup...


In [4]:
df.shape

(160, 3)

In [5]:
# train, validate, test, X_train, y_train, X_validate, y_validate, X_test, y_test = \
# wrangle.split_repos(df)

In [6]:
# train.shape, validate.shape, test.shape, X_train.shape, y_train.shape, X_validate.shape, y_validate.shape, X_test.shape, y_test.shape

### Steps Taken to Prepare the Data:

## Exploration

### Initial Hypotheses/Questions:

#### Initial Hypothesis: 

#### Initial Questions: 
    1. 
    2. 
    3. 
    4. 

### Explore: 

### Exploration Takeaways:

### Features to Move Forward with:

## Modeling

### Our best-performing model without undue overfitting was a Logistic Regression model

### We ran over 10 models, between a bag-of-words approach and a td-idf approach

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score

## TF-IDF Method

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df.lemmatized)
y = df.language

X_train, X_validate, y_train, y_validate = train_test_split(X, y, stratify=y, \
                                                            test_size=.2)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train,stratify=y_train, test_size=.2)

In [9]:
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))

lm = LogisticRegression().fit(X_train, y_train)

train['lr_predicted_tdidf'] = lm.predict(X_train)
validate['lr_predicted_tdidf'] = lm.predict(X_validate)

In [10]:
model.print_lr_tfidf_model_train(train)

Accuracy: 77.45%
---
Confusion Matrix
actual              C  C++  JavaScript  Other  PHP  Python
lr_predicted_tdidf                                        
C++                 0    1           0      0    0       0
JavaScript          0    0          23      0    0       0
Other               5    8           0     35    9       1
Python              0    0           0      0    0      20
---
              precision    recall  f1-score   support

           C       0.00      0.00      0.00         5
         C++       1.00      0.11      0.20         9
  JavaScript       1.00      1.00      1.00        23
       Other       0.60      1.00      0.75        35
         PHP       0.00      0.00      0.00         9
      Python       1.00      0.95      0.98        21

    accuracy                           0.77       102
   macro avg       0.60      0.51      0.49       102
weighted avg       0.73      0.77      0.70       102



In [11]:
model.print_lr_tfidf_model_validate(validate)

Accuracy: 34.38%
---
Confusion Matrix
actual              C  C++  JavaScript  Other  PHP  Python
lr_predicted_tdidf                                        
Other               2    3           7     11    3       6
---
              precision    recall  f1-score   support

           C       0.00      0.00      0.00         2
         C++       0.00      0.00      0.00         3
  JavaScript       0.00      0.00      0.00         7
       Other       0.34      1.00      0.51        11
         PHP       0.00      0.00      0.00         3
      Python       0.00      0.00      0.00         6

    accuracy                           0.34        32
   macro avg       0.06      0.17      0.09        32
weighted avg       0.12      0.34      0.18        32



## TF-IDF Method

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [13]:
tfidf = TfidfVectorizer()
X = vectorizer.fit_transform(df.lemmatized)
y = df.language

X_train, X_validate, y_train, y_validate = train_test_split(X, y, stratify=y, \
                                                            test_size=.2)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train,stratify=y_train, test_size=.2)

In [14]:
train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))

lm = LogisticRegression().fit(X_train, y_train)

train['lr_predicted_bagofwords'] = lm.predict(X_train)
validate['lr_predicted_bagofwords'] = lm.predict(X_validate)

In [15]:
model.print_lr_bagofwords_model_train(train)

Accuracy: 98.04%
---
Confusion Matrix
actual                   C  C++  JavaScript  Other  PHP  Python
lr_predicted_bagofwords                                        
C                        5    0           0      0    0       0
C++                      0    9           0      0    0       0
JavaScript               0    0          21      0    0       0
Other                    0    0           2     35    0       0
PHP                      0    0           0      0    9       0
Python                   0    0           0      0    0      21
---
              precision    recall  f1-score   support

           C       1.00      1.00      1.00         5
         C++       1.00      1.00      1.00         9
  JavaScript       1.00      0.91      0.95        23
       Other       0.95      1.00      0.97        35
         PHP       1.00      1.00      1.00         9
      Python       1.00      1.00      1.00        21

    accuracy                           0.98       102
   macro avg

In [16]:
model.print_lr_bagofwords_model_validate(validate)

Accuracy: 43.75%
---
Confusion Matrix
actual                   C  C++  JavaScript  Other  PHP  Python
lr_predicted_bagofwords                                        
JavaScript               0    1           2      3    0       1
Other                    1    2           3      7    1       0
PHP                      0    0           1      0    0       0
Python                   1    0           1      1    2       5
---
              precision    recall  f1-score   support

           C       0.00      0.00      0.00         2
         C++       0.00      0.00      0.00         3
  JavaScript       0.29      0.29      0.29         7
       Other       0.50      0.64      0.56        11
         PHP       0.00      0.00      0.00         3
      Python       0.50      0.83      0.62         6

    accuracy                           0.44        32
   macro avg       0.21      0.29      0.25        32
weighted avg       0.33      0.44      0.37        32



-----

# Conclusion

----

### Summary

### Next Steps