# Natural Language Processing Project

>Goals:
- Build a dataset of 100 Github repositories' readme text
- Explore the text of the readme's and find connections to programming language
- Build a classification ML model that predicts the programming language used in a repo based on readme content. 

In [1]:
import numpy as np
import pandas as pd
import json
import re
import warnings
warnings.filterwarnings("ignore")

from prepare import add_columns, split_repo_data

---
## Acquire

In [2]:
# raw data
df = pd.read_json('repos.json')

In [3]:
# summary of data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 599 entries, 0 to 598
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   language  599 non-null    object
 1   repo      599 non-null    object
 2   content   599 non-null    object
dtypes: object(3)
memory usage: 18.7+ KB


In [4]:
df.head()

Unnamed: 0,language,repo,content
0,JavaScript,facebook/react,React · \nReact is a JavaScript library for...
1,JavaScript,d3/d3,D3: Data-Driven Documents\n\nD3 (or D3.js) is ...
2,JavaScript,vuejs/vue,\n\n\n\n\n\n\n\n\n\n\nSupporting Vue.js\nVue.j...
3,JavaScript,axios/axios,axios\n\n\n\n\n\n\n\n\nPromise based HTTP clie...
4,JavaScript,facebook/create-react-app,Create React App \n\nCreate React apps with n...


In [5]:
# how many of each language
df.language.value_counts()

JavaScript    300
Python        299
Name: language, dtype: int64

In [6]:
# number of unique repos
df.repo.nunique()

581

<div class="alert alert-block alert-info">
<b>Summary</b>:
    <li> Data acquired using the BeautifulSoup library </li>
<li> Used helper functions to get requests to the first 30 search pages of most starred repos for Javascript and Python. </li>
 <li>Used helper function to parse HTML to find certain elements that contained the <i>programming language</i>, <i>repo-sub url</i>, and the <i>readme content</i> for each repo among said pages and saved to a DataFrame. Stored as a json file locally for reproduction.</li>
</div>

---
## Prepare

In [7]:
# prepared data
df = pd.read_json('repos_clean.json').reset_index().drop(columns='index')

In [8]:
# adds list of words column and length of cleaned doc
df = add_columns(df)

In [9]:
# prepped data summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581 entries, 0 to 580
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   language    581 non-null    object
 1   repo        581 non-null    object
 2   content     581 non-null    object
 3   stemmed     581 non-null    object
 4   lemmatized  581 non-null    object
 5   clean       581 non-null    object
 6   words       581 non-null    object
 7   doc_length  581 non-null    int64 
dtypes: int64(1), object(7)
memory usage: 36.4+ KB


In [12]:
# split the data
train, validate, test = split_repo_data(df)

In [15]:
print(train.language.value_counts(), '\n')
print(validate.language.value_counts(), '\n')
print(test.language.value_counts())
train.head(3)

JavaScript    180
Python        168
Name: language, dtype: int64 

JavaScript    60
Python        56
Name: language, dtype: int64 

JavaScript    60
Python        57
Name: language, dtype: int64


Unnamed: 0,language,clean,words,doc_length
277,Python,welcome streamlit fastest way build share data...,"[welcome, streamlit, fastest, way, build, shar...",202
97,JavaScript,translation espaol deutsch portugus trke add f...,"[translation, espaol, deutsch, portugus, trke,...",633
501,Python,plotlypy latest release user forum pypi downlo...,"[plotlypy, latest, release, user, forum, pypi,...",399


<div class="alert alert-block alert-info">
<b>Summary</b>:
    <li>Readme content is normalized, tokenized, stemmed, lemmatized, and stopwords are removed to produce "clean" content.</li>
    <li>Duplicate repos are removed and 2 columns are created</li>
    <li>The data is split into train, validate, and, test; stratifying on the programming language.</li>
</div>

---
## Explore

> **What's the proportion of each language in our data?**

> **What are the most common words in READMEs?**

> **Does the length of the README vary by programming language?**

> **Do different programming languages use a different number of unique words?**