# *Minecraft NLP Project*
### Presented by: Rae Downen, Cristina Lucin, Michael Mesa and John "Chris" Rosenberger

---

## Project Overview

This project focuses on building a prediction model for accurately predicting the coding language of a project using examination of GitHub repo Readme files. Our goal is to develop a predictive model utilizing Python and Python libraries and select the most effective model for production. Initially, we are utilizing BeautifulSoup to acquire our data, selecting 1000 repositories tagged with 'Minecraft' from GitHub, taking in all Readme text and repo language information from each repo. After gathering the data, we explore the data through questions and visualizations before developing a model that can tell us: "What language is this repository most likely to be written in?"

## Goals
### Create deliverables:
* READ ME
* Final Report
* Functional acquire.py, explore.py, and model.py files
* Acquire data from GitHub utilizing BeautifulSoup to scrape targeted Repositories ('Repos')
* Prepare and split the data
* Explore the data
* Establish a baseline
* Fit and train a classification model to predict the programming language of the Repo
* Evaluate the model by comparing its performance on train utilizing accuracy as a measure
* Evaluate the selected model on test data
* Develop and document findings, takeaways, recommendations, and next steps

In [1]:
#General DS Imports
import pandas as pd
import numpy as np

# Visualizations
import seaborn as sns
import matplotlib as plt
from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as plt
from PIL import Image

#Modeling, NLP and Exploration
from requests import get
from bs4 import BeautifulSoup
import json
from typing import Dict, List, Optional, Union, cast
import requests
import re
import time
import unicodedata
import nltk
from sklearn.model_selection import train_test_split

#My imports
import os
from env import github_token, github_username
from importlib import reload
import acquire as a
import explore as e
import prepare as p
import modeling as m

ModuleNotFoundError: No module named 'xgboost'

# Acquire
* 1,000 Repo URLs tagged "Minecraft" were acquired from GitHub utilizing a .py script "acquire_minecraft_urls.py"
* These Repos were identified and scraped through the search feature in GitHub
* Repo Readme Text and Repo Language was scraped utilizing BeautifulSoup
* Readme Text and Repo Language was collected into a dictionary using a function called "process_repo.py" and "scrape_github_data"
* This dictionary was turned into a data frame and CSV file
* The CSV file contained 1,000 rows and 3 features before cleaning
* Each row represents a unique Repo located on GitHub
* Each column represents a feature of the Repo, such as URL, Readme text, or Programming Language

# Prepare
#### Prepare Actions:
* Renamed columns to improve readability
* Removed white space from values in object columns
* Checked for null values in the data, dropped all rows where nulls existed
* Utilized Regex and string methods and functions to clean Repo Readme text

In [None]:
#Import our data from a .csv file, take a peek at the data
df = pd.read_csv(r'clean_scraped_data.csv', index_col=[0])
df.head()

In [None]:
# top 20 languages from Readme files
df.language.value_counts().head(20)

### We chose to focus on the top 3 programming languages found in the scraped Repos, classifying all other languages as "Other":

In [None]:
#Recast other languages as "Other"
df = p.map_other_languages(df)
df.head()

In [None]:
df.language.value_counts()

### Cleaning: We elected to remove reserved words that were in common with all 3 top languages and some that were both in common between Java and JavaScript.  We also removed words such as 'minecraft', 'server', 'run', etc. by utilizing stop words through a prepare function:

In [None]:
#Remove stopwords from dataframe
df = p.prep_readme_data(df, 'readme_contents')
df.head()

### Train-Test Split

* For exploration, we chose to do a train test split taking 20% for test, %30 of that for validate, and the remainder for train

In [None]:
train, validate, test = e.split_minecraft_data(df)
train.head()

## Question 1: What are the top programming languages found in #Minecraft GitHub Repos?

In [None]:
e.get_language_freq(train)

### Java was the most common language found in the Repos that we scraped from, followed by JavaScript and Python. All other languages are included in this visualization. This information made sense, considering that Minecraft was developed using Java.

----------

## Question 2: What is the average wordcount of a Repo Readme file based on their programming language?

In [None]:
e.get_wordcount_bar(train)

## Question 3: What are the top 10 most frequent words found in Python Repos?

In [None]:
e.get_top10_python(train)

In [None]:
e.get_python_wordcloud()

## Question 4: What are the top 10 most frequent words found in Java Repos?

In [None]:
e.get_top10_java(train)

In [None]:
e.get_java_wordcloud()

## Question 5: What are the top 10 most frequent words found in JavaScript Repos?

In [None]:
e.get_top10_js(train)

In [None]:
e.get_js_wordcloud()

## Exploration Summary
* Java was the most frequent language found in the Repositories examined
* JavaScript Repos had the highest average wordcount, Java Repos had the lowest
* "Install" was the most common word for Python Repos
* "Mod" and "Build" were the most frequently found Java strings
* "Command" was the most frequent word found in JavaScript Repos


----

# Modeling

* We elected to utilize accuracy as the evaluation metric
* We developed three different models using different model types: (Naive Bayes, SKLearn Gradient Booster, XG Boost)
* The model that performs the best was evaluated on test data
* **We utilized the mode of 'language' as the baseline (Java, 45.3)**

We explored several methods of NLP modeling. We elected to utilize as much useful text as possible. This is a multilabel classification project which makes the confusion matrix more complicated than more common classification problems. Due to that, we decided that the large string of text would be more useful for finding differences between the languages used in the Readme files.

We trained several models on our training set without hyperparameter tuning to produce models that were 'Good Enough'. These models overfit on training data. A problem with the differentiation within Readme files is that they all utilize normal, human language to describe a programming process. Because the Readme does not necessarily use specialized programming language, this made classification much more difficult.

---

## Initial model training

---

### Naive Bayes, SKLearn Gradient Booster, and Extreme Gradient Boosting (XG Boost)

In [None]:
m.get_model_tests()

---

*All models selected overfit the training data. We elected to utilize **SK Learn Gradient Boosting** because SK Learn is an open source algorithm with a lot of support.*

---

### Testing Selected model on unseen (test) data

In [None]:
#SK Learn Gradient Boosting Test
m.gb_test()

## Modeling Summary
* All models were overfit on the training data
* SKLearn Gradient Boost was chosen for test data
* **This model performed with a 76 percent accuracy, a 30 percent improvement from the baseline**

---

# Takeaways/ Conclusions

- GitHub Repos with different programming languages have significantly different features (word count and unique words)
- Because ReadMe files are written in normal language, the accuracy of any model is limited
- Improved cleaning methods may increase model performance
- Count Vectorization (CV) in combination with ensemble classification is an effective modeling strategy for NLP/Text Classification problems

# Recommendations

- Acquire longer Readme text files to feed into algorithm
- Narrow down parameters for classifications (more languages are more difficult to classify)
* Additional hyperparameter tuning may result in better model performance

# Next Steps

* Utilize statistical methods to identify additional stop words
* Develop and test different model types for performance
* Find alternative methods for pulling repo data from GitHub