## Predicting Programming Languages
### Natural Language Processing Among GitHub Repositories
By: _AJ Martinez,        
Ben Smith,        
Nicholas Dougherty_          

In [5]:
import pandas as pd
import numpy as np
import unicodedata
import re
import nltk

# imports for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image

# import modules 
from prepare import * 
import acquire 
#import explore 
#import model 

# imports for NLP extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# imports for modeling
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, recall_score, plot_confusion_matrix
from sklearn.linear_model import LogisticRegression

***
## Overview and Goals

The goal of this project is to determine the main coding language of a project based on the contents of it's github Readme, using NLP methods. The data was acquired from various repositories on Github. In order to recreate this project you will need to access the json of the data we acquired. During the acquisition of the repo names, we filtered for the word customer, not for any particular reason other than something to filter for.

A total of 193 Repos were obtained but after dropping nulls, readmes with Chinese characters, and slimming it down to the top most prevalent languages in our dataset we ended up with data from 106 different documents. The 4 languages that we filtered for were, Java, JavaScript, PHP, and Jupyter Notebook.

## Findings

We found that a () model using Lemmatized data performed the highest with an accuracy of () on the validate data set. With a final test accuracy of (). This outperformed our baseline accuracy of (). Our model was (list results and whatnot.)

### With More Time

We'd like to acquire more data to see if we can improve the results for distinguishing among (). Our sample size was fairly small during this project.
*** 
## Acquisition and Preparation

Data was obtained through functions that scraped repository collections on GitHub. First we manually explored GitHub using Chrome to inspect the HTML elements; the requests module obtained the HTML as a list of Universal Resource Locator (URL) endpoints, which were garnered from the trending portal and then appended to the origin. BeautifulSoup was essential in this regard. We scripted the process of requesting other pages, obtaining the README data form those as well through over a hundred repositories.       
Here is a segment of the code used, the full code  can be viewed in the acquire.py script elsewhere in our repository. 

```
# create an empty list to store endpoints
    endpoints = []
    # go to each url - trending repos daily, weekly, and monthly
    for url in ['https://github.com/trending?since=daily&spoken_language_code=en',
                'https://github.com/trending?since=weekly&spoken_language_code=en',
                'https://github.com/trending?since=monthly&spoken_language_code=en']:
        # get the response
        response = get(url)
        # create the beautiful soup object; It creates a parse tree from page source code
        soup = BeautifulSoup(response.text, 'html.parser')
        # identify html objects containing each repository
        for repo in soup.select('.Box-row'):
            # pull out the url endpoint for that repo and append to the list
            endpoints.append(repo
                             .select_one('h1')
                             .select_one('a')
                             .attrs['href'])
```
The assimilated data was stored in a .json file, which was then used to obtain our Dataframe, like so:

In [4]:
# Read in data from the JSON created through acquire
df = pd.read_json('data1.json')
# View the content of the first row 
df.head(1)

Unnamed: 0,repo,language,readme_contents
0,google/googletest,C++,"# GoogleTest\n\n### Announcements\n\n#### Live at Head\n\nGoogleTest now follows the\n[Abseil Live at Head philosophy](https://abseil.io/about/philosophy#upgrade-support).\nWe recommend\n[updating to the latest commit in the `main` branch as often as possible](https://github.com/abseil/abseil-cpp/blob/master/FAQ.md#what-is-live-at-head-and-how-do-i-do-it).\n\n#### Documentation Updates\n\nOur documentation is now live on GitHub Pages at\nhttps://google.github.io/googletest/. We recommend browsing the documentation on\nGitHub Pages rather than directly in the repository.\n\n#### Release 1.11.0\n\n[Release 1.11.0](https://github.com/google/googletest/releases/tag/release-1.11.0)\nis now available.\n\n#### Coming Soon\n\n* We are planning to take a dependency on\n [Abseil](https://github.com/abseil/abseil-cpp).\n* More documentation improvements are planned.\n\n## Welcome to **GoogleTest**, Google's C++ test framework!\n\nThis repository is a merger of the formerly separate GoogleTest and GoogleMock\nprojects. These were so closely related that it makes sense to maintain and\nrelease them together.\n\n### Getting Started\n\nSee the [GoogleTest User's Guide](https://google.github.io/googletest/) for\ndocumentation. We recommend starting with the\n[GoogleTest Primer](https://google.github.io/googletest/primer.html).\n\nMore information about building GoogleTest can be found at\n[googletest/README.md](googletest/README.md).\n\n## Features\n\n* An [xUnit](https://en.wikipedia.org/wiki/XUnit) test framework.\n* Test discovery.\n* A rich set of assertions.\n* User-defined assertions.\n* Death tests.\n* Fatal and non-fatal failures.\n* Value-parameterized tests.\n* Type-parameterized tests.\n* Various options for running the tests.\n* XML test report generation.\n\n## Supported Platforms\n\nGoogleTest requires a codebase and compiler compliant with the C++11 standard or\nnewer.\n\nThe GoogleTest code is officially supported on the following platforms.\nOperating systems or tools not listed below are community-supported. For\ncommunity-supported platforms, patches that do not complicate the code may be\nconsidered.\n\nIf you notice any problems on your platform, please file an issue on the\n[GoogleTest GitHub Issue Tracker](https://github.com/google/googletest/issues).\nPull requests containing fixes are welcome!\n\n### Operating Systems\n\n* Linux\n* macOS\n* Windows\n\n### Compilers\n\n* gcc 5.0+\n* clang 5.0+\n* MSVC 2015+\n\n**macOS users:** Xcode 9.3+ provides clang 5.0+.\n\n### Build Systems\n\n* [Bazel](https://bazel.build/)\n* [CMake](https://cmake.org/)\n\n**Note:** Bazel is the build system used by the team internally and in tests.\nCMake is supported on a best-effort basis and by the community.\n\n## Who Is Using GoogleTest?\n\nIn addition to many internal projects at Google, GoogleTest is also used by the\nfollowing notable projects:\n\n* The [Chromium projects](http://www.chromium.org/) (behind the Chrome browser\n and Chrome OS).\n* The [LLVM](http://llvm.org/) compiler.\n* [Protocol Buffers](https://github.com/google/protobuf), Google's data\n interchange format.\n* The [OpenCV](http://opencv.org/) computer vision library.\n\n## Related Open Source Projects\n\n[GTest Runner](https://github.com/nholthaus/gtest-runner) is a Qt5 based\nautomated test-runner and Graphical User Interface with powerful features for\nWindows and Linux platforms.\n\n[GoogleTest UI](https://github.com/ospector/gtest-gbar) is a test runner that\nruns your test binary, allows you to track its progress via a progress bar, and\ndisplays a list of test failures. Clicking on one shows failure text. GoogleTest\nUI is written in C#.\n\n[GTest TAP Listener](https://github.com/kinow/gtest-tap-listener) is an event\nlistener for GoogleTest that implements the\n[TAP protocol](https://en.wikipedia.org/wiki/Test_Anything_Protocol) for test\nresult output. If your test runner understands TAP, you may find it useful.\n\n[gtest-parallel](https://github.com/google/gtest-parallel) is a test runner that\nruns tests from your binary in parallel to provide significant speed-up.\n\n[GoogleTest Adapter](https://marketplace.visualstudio.com/items?itemName=DavidSchuldenfrei.gtest-adapter)\nis a VS Code extension allowing to view GoogleTest in a tree view and run/debug\nyour tests.\n\n[C++ TestMate](https://github.com/matepek/vscode-catch2-test-adapter) is a VS\nCode extension allowing to view GoogleTest in a tree view and run/debug your\ntests.\n\n[Cornichon](https://pypi.org/project/cornichon/) is a small Gherkin DSL parser\nthat generates stub code for GoogleTest.\n\n## Contributing Changes\n\nPlease read\n[`CONTRIBUTING.md`](https://github.com/google/googletest/blob/master/CONTRIBUTING.md)\nfor details on how to contribute to this project.\n\nHappy testing!\n"


From here, we break our data down into smaller component via parsing tools from the nltk packaged library. 
- All text was converted to lowercase for the sake of normalcy
- Removed:
    - accented, non-ASCII characters
    - special characters
    - stopwords
- Words were stemmed and lemmatized as well. 

All of these processes were combined into a single function:

In [6]:
# prepare the dataframe and return text stemmed, lemmatized, cleaned, tokenized, et cetera
df = prep_repos(df)

## Exploratory Data Analysis

## Modeling