# NLP Project 

___

By: Alejandro Garcia, Forest Hensley, and Tarrence Nichols
<br>
Date: May 17, 2022

___

## Executive Summary

### Big Idea

Determine the primary programming language of a GitHub repository by using natural language processing (NLP) techniques on their `README.md`.

### Goals

1. Predict the programming language of a repo by using NLP on the `README.md
2. Conclude if there is a statistically significant difference between `README.md` lengths from the top 3 most common languages.

### Key Findings


### Recomendations



___

## Project Description

In this project, we will attempt to use data from a `README.md` to predict what language a GitHub repo is primarilly coded in.

The following outlines the process taken through the Data Science Pipeline to complete this project.

Plan &#8594; Acquire &#8594; Prepare &#8594; Explore &#8594; Model &#8594; Deliver

___

## Importing the Required Modules

Everything we need to run the code blocks in this notebook are located in the top level directory. To run the code blocks in this report you will need python, numpy, pandas, matplotlib, seaborn, ntlk and sklearn installed on your computer.


In [5]:
# imports.py
from imports import *

# plotting magic
%matplotlib inline
# plotting defaults
plt.rc('figure', figsize=(16, 9))
plt.style.use('seaborn-darkgrid')
plt.rc('font', size=16)
# plt.style.available
# # ^^^ show available seaborn styles

# !!! Warning !!! 
# *** no more warnings ***
import warnings
warnings.filterwarnings("ignore")

# # custom mods
from acquire import *
from prepare import *


___

## Data Acquisition and Preparation

We start by searching github.com for repo's related to the search term "bitcoin". This search is done via GitHub's API and a list is extracted that contains the url path to 100 related repos. We use the list to ascertain the contents of the `README.md` from each repo. The path and language of the repo are gathered additionally.

Now begins the challenge of quantizing communications in the english lanuage. NLP attempts to do just that by utilizing cutting edge computational power. Common parsing techniques are used on the original corpus collected from GitHub. In this project, the contents of an individual `README.md` are treated as a document. Each document is changed to all lower case letters, has punctuation removed, is tokenized, and has stop-words removed as a function of basic cleaning. Further preprocessing includes stemming and lemmetization. Column names are changed for convenience and the all languages other than the top 3 are consolidated into category 'other'. The tidied strings are returned in a single Pandas dataframe.


In [4]:
# For demonstration purposes only, the `data.json` file is being pulled from cache.
# For the initial run on a new machine, please run `python acquire.py` in the terminal prior to running this notebook.

# Results from `acquire.py`, loaded as a Pandas dataframe.
df=pd.read_json('/Users/hinzlehome/codeup-data-science/Garcia-Hensley-Nichols-NLP-project/data.json')

# `README.md` contents from above are tidied and returned with a stemmed and lemmtized variant included.
df=words(df)

df.head()

Unnamed: 0,repo,language,readme,clean,stemmed,lemmatized,contains_python_keywords,contains_cpp_keywords,contains_js_keywords
0,bitcoin/bitcoin,C++,Bitcoin Core integration/staging tree\n=======...,bitcoin core integrationstaging tree httpsbitc...,bitcoin core integrationstag tree httpsbitcoin...,bitcoin core integrationstaging tree httpsbitc...,1,1,0
1,bitcoinbook/bitcoinbook,other,Code Examples: ![travis_ci](https://travis-ci....,code examples traviscihttpstravisciorgbitcoinb...,code exampl traviscihttpstravisciorgbitcoinboo...,code example traviscihttpstravisciorgbitcoinbo...,0,0,0
2,bitcoin/bips,other,"People wishing to submit BIPs, first should pr...",people wishing submit bips first propose idea ...,peopl wish submit bip first propos idea docume...,people wishing submit bips first propose idea ...,1,1,0
3,bitcoinjs/bitcoinjs-lib,other,# BitcoinJS (bitcoinjs-lib)\n[![Github CI](htt...,bitcoinjs bitcoinjslib github cihttpsgithubcom...,bitcoinj bitcoinjslib github cihttpsgithubcomb...,bitcoinjs bitcoinjslib github cihttpsgithubcom...,1,0,1
4,spesmilo/electrum,Python,Electrum - Lightweight Bitcoin client\n=======...,electrum lightweight bitcoin client licence mi...,electrum lightweight bitcoin client licenc mit...,electrum lightweight bitcoin client licence mi...,1,1,0



___

## Exploratory Analysis

(Project specific details here)

In the visualizations below we aim to answer some questions about the data. Details about how these visualization are created can be found in the explore.py file.

### Question 1

1. Can we predict the programming language of a repo by using NLP on the `README.md`?

### Question 2

2. Is there a statistically significant difference between `README.md` lengths from the top 3 most common languages?

### Question 3

### Question 4

### Key Takeaways



___
## Modeling





In [24]:
# return train, validate, test splits for supervised machine learning
# the target (y) is isolated from the features (X)
# a modeling set is returned for the cleaned, trimmed, and lemmetized corpora

(X_lem_train,X_lem_validate, X_lem_test,

y_lem_train, y_lem_validate, y_lem_test,

X_stem_train, X_stem_validate, X_stem_test,

y_stem_train, y_stem_validate, y_stem_test,

X_clean_train, X_clean_validate, X_clean_test,

y_clean_train, y_clean_validate, y_clean_test) = model_prep(df)

In [25]:
print(X_lem_train.head(),'\n')
print(y_lem_train.head())


                                           lemmatized
21  originalbitcoin historical repository satoshi ...
99  bitcoin hardware wallet interface build status...
71  exchange software used intersangocom britcoinc...
97  ruimarinhobitcoincore bitcoincore docker image...
35  zeronet build statushttpstravisciorghellozeron... 

21           C++
99        Python
71         other
97         other
35    JavaScript
Name: language, dtype: object



___

## Key Takeaways and Recommendations



___

## Next Steps

