# Natural Language Processing Project<a class="anchor" id="top"></a>
## Team Members
[Bethany Thompson](https://github.com/ThompsonBethany01) & [Bibek Mainali](https://github.com/MainaliB)
## Goals 
Predict a repository coding language by it's readme file.
- Acquire data on GitHub's trending repositories
- Clean data by normalizing any text
- Explore trends in text within each coding language
- Create a classification model to predict the coding language

## Conclusions
- Trends:
- Model Metrics:

## Reproduction Requirements
### Files
In your working directory, download:
- Data_Analysis.ipynb
- Acquire.py
- Prepare.py  

Tools:
- Python Version
- Pandas Version
- Other Versions

## Table of Contents
1. [Acquisition](#first-bullet)
2. [Preparation](#second-bullet)
3. [Exploration](#third-bullet)
4. [Modeling](#fourth-bullet)
5. [Final Conclusions](#fifth-bullet)

# Acquisition <a class="anchor" id="first-bullet"></a>
For this project, we have to build the dataset. We decided on a list of GitHub repositories to scrape, and wrote the python code necessary to extract the text of the README file for each page, and the primary language of the repository.

To can find the language of a repository:
1. Visit Main Page of Repo
1. Locate Bottom Right Side of Repo stating **Languages** 
2. html code ```<ul class="list-style-none">```

The only requirement is to include at least 100 repositories in our data set.

## Repositories Chosen
- GitHub's Trending English Repositories - At Least 25 from Top 4 Most Popular Coding Languages
     - Python
     - Java
     - Swift
     - Something Else


In [2]:
# Acquire Imports
import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import os

# Acquire.py Module
import Acquire

## Acquire.get_top_repo Function
- Scrapes repository names from the trending GitHub repo page, acquiring 25 from each coding language filter of Python, Java, Javescript, and Swift
- creates url from repo name user/repo_name

In [None]:
# urls = Acquire.get_top_repo(['python','java','javascript','swift'],'daily')

In [None]:
# urls.head()

In [None]:
# no repeat urls present
# urls.link.value_counts()[urls.link.value_counts() > 1]

In [None]:
# expected amount of coding languages
# urls.language.value_counts()

## Acquire.get_content_df Function

In [None]:
# code ran once for acquire and prep, final df saved to csv after prepare
# df = Acquire.get_content_df(urls['link'])

In [None]:
# df

### Takeaways
Our df includes:
- content as Readme file text
- watchers as number of users watching the repo
- stars as number of users that have starred the repo
- forks as number of users that have forked the repo

Next steps:
1. clean the text file
2. convert counts from strings to integeres, i.e. 1.5k to 1500

##### [Back to Top](#top)

# Preparation <a class="anchor" id="second-bullet"></a>
Within the Prepare.py function:
- readme file text is normalized using Natural Language Processing
- string numbers are converted to integers using pandas

In [3]:
import Prepare

In [None]:
# df = Prepare.prepare_df(df)

In [None]:
# df = df.merge(urls, left_on=df.index, right_on=urls.index).drop('key_0', axis=1)

In [None]:
# will work with this df from now up to testing the final model chosen
# will generate new data later to evaluate the final model on test
# df.to_csv('train_validate.csv')

In [4]:
df = pd.read_csv('train_validate.csv', index_col=0)

In [None]:
df.head()

### Takeaways
Conclusions:  
Next Steps:

##### [Back to Top](#top)

# Exploration <a class="anchor" id="third-bullet"></a>
### Before splitting the df, we can do univariate exploration:
   - distributions of single variables
   - determine if outliers are present - are they okay in the context or need to be removed?  


### Split the data into train and validate for bivariate analysis
   - What are the most common words in READMEs?
   - What does the distribution of IDFs look like for the most common words?
   - Does the length of the README vary by programming language?
   - Do different programming languages use a different number of unique words?
   - What words are present only within the specific coding languages?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rc('figure', figsize = [13,9])
from wordcloud import WordCloud

import nltk
import re

## Univariate Analysis Before Splitting the DF

In [None]:
df.describe()

In [None]:
x = 1
plt.figure(figsize=(13,9))
for col in df.describe():
    
    plt.subplot(3,2,x)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count of Repos')
    plt.hist(df[col])
    x = x + 1
    
plt.tight_layout()

## Splitting the DF Into Train and Validate for Bivariate Analysis and Modeling
- Prepare Function Splits DF Into 68% Train, 32% Validate
    - 17 observations from each lanuage for Train
    - 8 observations from each language for Validate

In [5]:
train, validate = Prepare.train_validate(df)


train percent:  68.0 , validate percent:  32.0


In [6]:
train.language.value_counts()

javascript    18
java          18
python        16
swift         16
Name: language, dtype: int64

## Exploring Train DF Only
### Plotting Word Probability by Langauge

In [None]:
python_words = ' '.join(train[train.language == 'python'].filtered)
java_words = ' '.join(train[train.language == 'java'].filtered)
javascript_words = ' '.join(train[train.language == 'javascript'].filtered)
swift_words = ' '.join(train[train.language == 'swift'].filtered)
all_words = ' '.join(train.filtered)

In [None]:
python_words = re.sub(r'\s.\s', '', python_words)
java_words = re.sub(r'\s.\s', '', java_words)
javascript_words = re.sub(r'\s.\s', '', javascript_words)
swift_words = re.sub(r'\s.\s', '', swift_words)
all_words = re.sub(r'\s.\s', '', all_words)

In [None]:
python_words_freq = pd.Series(python_words.split()).value_counts()
java_words_freq = pd.Series(java_words.split()).value_counts()
javascript_words_freq = pd.Series(javascript_words.split()).value_counts()
swift_words_freq = pd.Series(swift_words.split()).value_counts()
all_words_freq = pd.Series(all_words.split()).value_counts()

In [None]:
word_count = (pd.concat([all_words_freq, python_words_freq, java_words_freq, javascript_words_freq, swift_words_freq], axis=1, sort=True)
              .set_axis(['all','python', 'java', 'javascript', 'swift'], axis=1, inplace=False)
              .fillna(0).apply(lambda s: s.astype(int)))

In [None]:
# lets plot the proportion of different languages amongst the top 50 occuring words

word_count.assign(p_python = word_count.python/word_count['all'],
                  p_java = word_count.java/word_count['all'],
                  p_javascript = word_count.javascript/word_count['all'],
                  p_swift = word_count.swift/word_count['all']).sort_values(by = 'all')[['p_python', 'p_java', 'p_javascript', 'p_swift']].tail(30).sort_values('p_python').plot.barh(width=.75,stacked = True, color={'darkseagreen','lightsteelblue','teal','thistle'}).legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
    
plt.title('Proportion of 30 Most Common Words by Language Within Our Sample', size=17)
plt.yticks(size=13)
plt.xticks(size=15)
plt.savefig('word_prob.png')
plt.show()

### Plotting Single Word Clouds by Language

In [None]:
# creating word cloud for all of the different programming languages
python_cloud = WordCloud(background_color = 'white', height = 1000, width = 1000).generate(python_words)
java_cloud = WordCloud(background_color = 'white', height = 1000, width = 1000).generate(java_words)
javascript_cloud = WordCloud(background_color = 'white', height = 1000, width = 1000).generate(javascript_words)
swift_cloud = WordCloud(background_color = 'white', height = 1000, width = 1000).generate(swift_words)

In [None]:
# plotting the word cloud
fig, axes = plt.subplots(2,2, figsize = (13,13))
axes[0,0].imshow(python_cloud)
axes[0,0].set_title('Python Cloud', size=20)
axes[0,1].imshow(java_cloud)
axes[0,1].set_title('Java Cloud', size=20)
axes[1,0].imshow(javascript_cloud)
axes[1,0].set_title('JavaScript Cloud', size=20)
axes[1,1].imshow(swift_cloud)
axes[1,1].set_title('Swift Cloud', size=20)

plt.savefig('word_clouds.png')

### Plotting Word Cloud of Complete Train

In [None]:
all_cloud = WordCloud(background_color = 'white', height = 1000, width = 1000).generate(all_words)
plt.imshow(all_cloud)

### Plotting Bi-gram Word Clouds by Language

In [None]:
python_bigrams = pd.Series(list(nltk.ngrams(python_words.split(), 2))).value_counts().head(25)
java_bigrams = pd.Series(list(nltk.ngrams(java_words.split(), 2))).value_counts().head(25)
javascript_bigrams = pd.Series(list(nltk.ngrams(javascript_words.split(), 2))).value_counts().head(25)
swift_bigrams = pd.Series(list(nltk.ngrams(swift_words.split(), 2))).value_counts().head(25)

In [None]:
java_bigrams.sort_values().plot.barh(color='lightsteelblue', width=.9, figsize=(10, 6))
    
plt.title('25 Most Frequently Occuring Java Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances') 

In [None]:
swift_bigrams.sort_values().plot.barh(color='thistle', width=.9, figsize=(10, 6))
    
plt.title('25 Most Frequently Occuring Java Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances') 

In [None]:
python_bigrams.sort_values().plot.barh(color='darkseagreen', width=.9, figsize=(10, 6))
    
plt.title('25 Most Frequently Occuring Python Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances') 

In [None]:
javascript_bigrams.sort_values().plot.barh(color='teal', width=.9, figsize=(10, 6))
    
plt.title('25 Most Frequently Occuring Java Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances') 

In [None]:
python_data = {k[0] + ' ' + k[1]: v for k, v in python_bigrams.to_dict().items()}
java_data = {k[0] + ' ' + k[1]: v for k, v in java_bigrams.to_dict().items()}
javascript_data = {k[0] + ' ' + k[1]: v for k, v in javascript_bigrams.to_dict().items()}
swift_data = {k[0] + ' ' + k[1]: v for k, v in swift_bigrams.to_dict().items()}

In [None]:
# creating the bigram cloud
cloud_python = WordCloud(background_color = 'white', height = 1000, width = 1000)\
.generate_from_frequencies(python_data)

cloud_java = WordCloud(background_color = 'white', height = 1000, width = 1000).\
generate_from_frequencies(java_data)

cloud_javascript = WordCloud(background_color = 'white', height = 1000, width = 1000).\
generate_from_frequencies(javascript_data)

cloud_swift = WordCloud(background_color = 'white', height = 1000, width = 1000).\
generate_from_frequencies(swift_data)

In [None]:
# plotting the bigram word cloud
fig, axes = plt.subplots(2,2, figsize = (13,13))
axes[0,0].imshow(cloud_python)
axes[0,0].set_title('Python Cloud Bigrams')
axes[0,1].imshow(cloud_java)
axes[0,1].set_title('Java Cloud Bigrams')
axes[1,0].imshow(cloud_javascript)
axes[1,0].set_title('JavaScript Cloud Bigrams')
axes[1,1].imshow(cloud_swift)
axes[1,1].set_title('Swift Cloud Bigrams')

### Plotting Frequency of Bigrams by Language

### Do any features correlate with Word_Length?
- not including word_length and char_length, which are derived from the same feature
- what if we control for language?

In [None]:
sns.set_theme(style="white")

# Compute the correlation matrix
corr = (train[['language','word_length','watchers','stars','forks']].corr())

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.title('Correlation Heatmap for All Observations', size = 20)

### Heatmap by Language

In [None]:
sns.set_theme(style="white")

y = 1

for x in ['python','java','javascript','swift']:
    
    plt.subplot(2,2,y)
    
    # Compute the correlation matrix
    corr = (train[train[['language','word_length','watchers','stars','forks']].language == x].corr())

    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1,center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

    plt.title(f'Correlation Heatmap for {x}', size = 20)
    
    y+=1
    
plt.tight_layout()

plt.savefig('language_corr.png')

### Exploring Document Length by Language

In [None]:
sns.set_theme(style="ticks")

# Initialize the figure with a logarithmic x axis
f, ax = plt.subplots(figsize=(7, 6))
ax.set_xscale("log")

# Plot the orbital period with horizontal boxes
sns.boxplot(x="word_length", y="language", data=train,
            whis=[0, 100], width=.6, palette="vlag")

# Add in points to show each observation
sns.stripplot(x="word_length", y="language", data=train,
              size=4, color=".3", linewidth=0)

# Tweak the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

plt.title('Distribution of Word Length by Language', size=20)

### Do Stars, Watchers, and Forks Significantly Vary by Language?

In [None]:
sns.set_theme(style="ticks")

plt.figure(figsize=(30,15))

# Show the joint distribution using kernel density estimation
g = sns.jointplot(
    data=train,
    x="stars", y="word_length", hue="language",
    kind="kde",
)


In [None]:
sns.set_theme(style="ticks")

plt.figure(figsize=(30,15))

# Show the joint distribution using kernel density estimation
g = sns.jointplot(
    data=train,
    x="watchers", y="word_length", hue="language",
    kind="kde",
)

In [None]:
sns.set_theme(style="ticks")

plt.figure(figsize=(30,15))

# Show the joint distribution using kernel density estimation
g = sns.jointplot(
    data=train,
    x="forks", y="word_length", hue="language",
    kind="kde",
)

### Takeaways
Conclusions:  
Next Steps:

##### [Back to Top](#top)

# Modeling <a class="anchor" id="fourth-bullet"></a>

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split

import nltk  
import random  
import string
import bs4 as bs  
import urllib.request  
import re  

In [69]:
# nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bethanythompson/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Bag of Words
https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/

In [65]:
corpus = df.tokenized

In [70]:
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

In [75]:
wordfreq

{'youtubedl': 119,
 'download': 136,
 'videos': 73,
 'from': 542,
 'youtubecom': 2,
 'or': 747,
 'other': 130,
 'video': 189,
 'platformsinstallationdescriptionoptionsconfigurationoutput': 1,
 'templateformat': 1,
 'selectionvideo': 1,
 'selectionfaqdeveloper': 1,
 'instructionsembedding': 1,
 'youtubedlbugscopyrightinstallationto': 1,
 'install': 192,
 'it': 662,
 'right': 32,
 'away': 5,
 'for': 1393,
 'all': 319,
 'unix': 5,
 'users': 48,
 'linux': 21,
 'macos': 48,
 'etc': 26,
 'typesudo': 1,
 'curl': 4,
 'l': 3,
 'httpsytdlorgdownloadslatestyoutubedl': 3,
 'o': 37,
 'usrlocalbinyoutubedlsudo': 3,
 'chmod': 3,
 'arx': 3,
 'usrlocalbinyoutubedlif': 1,
 'you': 1376,
 'do': 220,
 'not': 458,
 'have': 236,
 'can': 785,
 'alternatively': 4,
 'use': 602,
 'a': 2206,
 'recent': 9,
 'wgetsudo': 1,
 'wget': 3,
 'usrlocalbinyoutubedlwindows': 1,
 'an': 479,
 'exe': 6,
 'file': 223,
 'and': 2261,
 'place': 20,
 'in': 1677,
 'any': 220,
 'location': 16,
 'on': 737,
 'their': 80,
 'path': 53,
 

In [76]:
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

In [77]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [78]:
sentence_vectors = np.asarray(sentence_vectors)

In [80]:
len(sentence_vectors)

100

In [81]:
y = df.language

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.2)

In [83]:
train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

In [84]:
lm = LogisticRegression().fit(X_train, y_train)

In [85]:
train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)

In [86]:
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

Accuracy: 100.00%
---
Confusion Matrix
actual      java  javascript  python  swift
predicted                                  
java          20           0       0      0
javascript     0          20       0      0
python         0           0      20      0
swift          0           0       0     20
---
              precision    recall  f1-score   support

        java       1.00      1.00      1.00        20
  javascript       1.00      1.00      1.00        20
      python       1.00      1.00      1.00        20
       swift       1.00      1.00      1.00        20

    accuracy                           1.00        80
   macro avg       1.00      1.00      1.00        80
weighted avg       1.00      1.00      1.00        80



## Evaluating on Validate (X_train)

In [87]:
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

Accuracy: 85.00%
---
Confusion Matrix
actual      java  javascript  python  swift
predicted                                  
java           5           1       1      0
javascript     0           4       0      1
python         0           0       4      0
swift          0           0       0      4
---
              precision    recall  f1-score   support

        java       0.71      1.00      0.83         5
  javascript       0.80      0.80      0.80         5
      python       1.00      0.80      0.89         5
       swift       1.00      0.80      0.89         5

    accuracy                           0.85        20
   macro avg       0.88      0.85      0.85        20
weighted avg       0.88      0.85      0.85        20



## Feature Extraction: TF-IDF
- TF: Term Frequency; how often a word appears in a document.
- IDF: Inverse Documnet Frequency; a measure based on in how many documents will a word appear.
- TF-IDF: A combination of the two measures above.


## TF_iDF Modeling
- create term frequency on whole df
- split into train and test for X and y
- predict on train
- predict on test
- evaluate

In [48]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df.filtered)
y = df.language

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.2)

In [56]:
train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

In [57]:
lm = LogisticRegression().fit(X_train, y_train)

In [58]:
train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)

In [62]:
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

Accuracy: 100.00%
---
Confusion Matrix
actual      java  javascript  python  swift
predicted                                  
java          20           0       0      0
javascript     0          20       0      0
python         0           0      20      0
swift          0           0       0     20
---
              precision    recall  f1-score   support

        java       1.00      1.00      1.00        20
  javascript       1.00      1.00      1.00        20
      python       1.00      1.00      1.00        20
       swift       1.00      1.00      1.00        20

    accuracy                           1.00        80
   macro avg       1.00      1.00      1.00        80
weighted avg       1.00      1.00      1.00        80



## Evaluating on Validate (X_test)

In [63]:
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

Accuracy: 75.00%
---
Confusion Matrix
actual      java  javascript  python  swift
predicted                                  
java           2           1       0      0
javascript     1           3       0      0
python         1           1       5      0
swift          1           0       0      5
---
              precision    recall  f1-score   support

        java       0.67      0.40      0.50         5
  javascript       0.75      0.60      0.67         5
      python       0.71      1.00      0.83         5
       swift       0.83      1.00      0.91         5

    accuracy                           0.75        20
   macro avg       0.74      0.75      0.73        20
weighted avg       0.74      0.75      0.73        20



## Term Frequency (TF)
Term frequency can be calculated in a number of ways, all of which reflect how frequently a word appears in a document.

- Raw Count: This is simply the count of the number of occurances of each word.
- Frequency: The number of times each word appears divided by the total number of words.
- Augmented Frequency: The frequency of each word divided by the maximum frequency. This can help prevent bias towards larger documents.  

Would another way increase performance on validate?

### Takeaways
Conclusions:  
Next Steps:

##### [Back to Top](#top)

# Final Conclusions <a class="anchor" id="fifth-bullet"></a>

##### [Back to Top](#top)