## Description

This data was acquired using web scraping techniques to pull from github based on search results from the word "repository". After obtaining the repo name, primary coding language, and readme contents, I processed the readme data into a form that would be easier to use natural language methods on. The goal here being to predict the primary coding language based on the contents of the readme. 

## Findings

- It was found that Python, Java, and C++ were the most common coding languages
- Only about a third of the repositories had specific coding languages mentioned in their readmes
- The common words found in each coding language group varied from group to group
- The best performing model (KNN) beat my baseline by 19% 

## Packages

In [1]:
from requests import get
import numpy as np
from bs4 import BeautifulSoup
import bs4
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import requests
import os
import json
from typing import Dict, List, Optional, Union, cast
import requests
import prepare
import acquire

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


from env import github_token, github_username

headers = {"Authorization": f"token {github_token}", "User-Agent": github_username}

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/samkeeler/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samkeeler/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Acquire

In [2]:
''' Acquires the data via a saved csv, or if that is not present runs the scrape_github_data function '''

df = acquire.get_repo_data(cached = True)

## Prepare

In [3]:
''' 
Takes in readme contents and applies the make_prepped_columns() function. Drops altered readme 
columns with the exception of the lemmatized one (as that's what I'll be working with) and the original.
Also gets rid of rows containing null readme or null language columns. Drops rows if their respective language 
appears less than twice (can't split it). Removes stopwords.
'''

df = prepare.prep_repos(df)

'''
Adds a feature that searches the readme for mentions of a specific coding language and extracts it, then puts that 
language into the "languages_in_readme" column. Dummies are then created for each language found. Also adds a 
feature for "readme_length"
'''

df = prepare.add_language_dummies_and_length_feature(df)

In [None]:
df.head()

## Split

In [None]:
# Splits the data into train, validate, and test

train, validate, test = prepare.split(df, stratify_by = 'language')

## Explore

In [None]:
# Seeing how often each language appears in the dataset

df.language.value_counts()

In [None]:
# Now I visualize the language distributions

plt.subplots(figsize = (20, 6))
plt.title('Coding Language Distribution')
sns.countplot(x="language", data=df,
                 palette="Blues_d")

In [None]:
# Finding the top words from all of the readme's combined

the_words = ' '.join(df.readme_contents_clean)
all_freq = pd.Series(the_words.split(' ')).value_counts()

all_word_freq = pd.DataFrame(all_freq, columns =['Count'])
all_word_freq = all_word_freq.reset_index()
all_word_freq = all_word_freq.rename(columns = {'index':'Top Words'})

plt.subplots(figsize = (20, 6))
plt.title('All Words Distribution')
sns.barplot(x="Top Words", y='Count', data=all_word_freq.head(10),
                 palette="Purples_d")

In [None]:
# Viewing the top words of the three most frequent coding languages in the dataset

python_words = ' '.join(train[train.language == 'Python'].readme_contents_clean)
java_words = ' '.join(train[train.language == 'Java'].readme_contents_clean)
cplusplus_words = ' '.join(train[train.language == 'C++'].readme_contents_clean)

python_freq = pd.Series(python_words.split(' ')).value_counts()
java_freq = pd.Series(java_words.split(' ')).value_counts()
cplusplus_freq = pd.Series(cplusplus_words.split(' ')).value_counts()

print('Popular Words in Python'), print(python_freq.head()), print(f"\n"), print('Popular Words in Java'),
print(java_freq.head()), print("\n"),print('Popular Words in C++'), print(cplusplus_freq.head())

In [None]:
# Showing a distribution of the top words used in readmes where the repo's main language was python

py_word_freq = pd.DataFrame(python_freq, columns =['Count'])
py_word_freq = py_word_freq.reset_index()
py_word_freq = py_word_freq.rename(columns = {'index':'Top Words'})

plt.subplots(figsize = (20, 6))
plt.title('Python Word Distribution')
sns.barplot(x="Top Words", y='Count', data=py_word_freq.head(10),
                 palette="Reds_d")

In [None]:
# Showing a distribution of the top words used in readmes where the repo's main language was java

java_word_freq = pd.DataFrame(java_freq, columns =['Count'])
java_word_freq = java_word_freq.reset_index()
java_word_freq = java_word_freq.rename(columns = {'index':'Top Words'})

plt.subplots(figsize = (20, 6))
plt.title('Java Word Distribution')
sns.barplot(x="Top Words", y='Count', data=java_word_freq.head(10),
                 palette="Oranges_d")

In [None]:
# Showing a distribution of the top words used in readmes where the repo's main language was C++

cpl_word_freq = pd.DataFrame(cplusplus_freq, columns =['Count'])
cpl_word_freq = cpl_word_freq.reset_index()
cpl_word_freq = cpl_word_freq.rename(columns = {'index':'Top Words'})

plt.subplots(figsize = (20, 6))
plt.title('C++ Word Distribution')
sns.barplot(x="Top Words", y='Count', data=cpl_word_freq.head(10),
                 palette="Greens_d")

In [None]:
# Surprisingly there was a significant difference in the words used from coding language to coding language. 
# Among the top 3 coding language's top 10 words, only two words were present in more than one language (project 
# and module)

In [None]:
# Viewing the average readme length grouped by coding language. There is a large amount of variance in the
# readme lengths

language_lengths = train.groupby('language').readme_length.mean().sort_values(ascending = False)
language_lengths

In [None]:
# Now I'm gonna visualize it

lang_length_df = pd.DataFrame(language_lengths)
lang_length_df = lang_length_df.reset_index()
lang_length_df = lang_length_df.rename(columns = {'language': 'Programming Language',
                                                  'readme_length': 'Readme Length (characters)'})
plt.subplots(figsize = (15, 7.5))
plt.title('Mean Readme Length By Language')
sns.barplot(x='Programming Language', y='Readme Length (characters)', data=lang_length_df,
                 palette="Greens_d")



In [None]:
# A sample of repos that had languages listed in the readme

df[df['languages_in_readme'].notnull()].head(8)

## Prep for Modeling

In [None]:
# Splits the data into train, validate, and test
df.drop(columns = ['languages_in_readme', 'repo'], inplace = True)
train, validate, test = prepare.split(df, stratify_by = 'language')

In [None]:
# Splitting from target variable for creating models

X_train = train.drop(columns = ['language'])
X_validate = validate.drop(columns = ['language'])
X_test = test.drop(columns = ['language'])

In [None]:
# Creating target variable groups for creating models

y_train = train.language
y_validate = validate.language
y_test = test.language

In [None]:
# Creating a vectorizer object 

tfidf = TfidfVectorizer()

# Fitting that object onto the train data

tfidf.fit(X_train.readme_contents_clean)

# Applying the vector transformer to each data set

X_train_vectorized = tfidf.transform(X_train.readme_contents_clean)
X_validate_vectorized = tfidf.transform(X_validate.readme_contents_clean)
X_test_vectorized = tfidf.transform(X_test.readme_contents_clean)

In [None]:
# Creating a dataframe that will hold predicted and actual values for evaluation metrics

train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))
test = pd.DataFrame(dict(actual=y_test))

## Modeling

In [None]:
# Establishing a baseline

print('Baseline Accuracy:', round((21/len(df)), 3))

## Logistic Regression

In [None]:
# Creating and fitting the logistic regression model

lm = LogisticRegression()
lm.fit(X_train_vectorized, y_train)

In [None]:
# Applying and evaluating the logistic regression model

train['predicted_logreg'] = lm.predict(X_train_vectorized)
validate["predicted_logreg"] = lm.predict(X_validate_vectorized)
print('Train:', (train.actual == train.predicted_logreg).mean()), print('Validate:', (validate.actual == validate.predicted_logreg).mean())

## Gaussian Naive Bayes

In [None]:
# Creating and fitting the naive bayes model

gnb = GaussianNB()
gnb.fit(X_train_vectorized.toarray(), y_train)

In [None]:
# Applying and evaluating the naive bayes model

train['predicted_gnb'] = gnb.predict(X_train_vectorized.toarray())
validate['predicted_gnb'] = gnb.predict(X_validate_vectorized.toarray())
print('Train:', (train.actual == train.predicted_gnb).mean()), print('Validate:', (validate.actual == validate.predicted_gnb).mean())

## Random Forest

In [None]:
# Creating and fitting the random forest model

rf = RandomForestClassifier(max_depth = 8, min_samples_leaf = 3, random_state=123)
rf.fit(X_train_vectorized, y_train)

In [None]:
# Applying and evaluating the random forest model

train['predicted_rf'] = rf.predict(X_train_vectorized.toarray())
validate['predicted_rf'] = rf.predict(X_validate_vectorized.toarray())
print('Train:', (train.actual == train.predicted_rf).mean()), print('Validate:', (validate.actual == validate.predicted_rf).mean())

## KNN

In [None]:
# Creating and fitting the KNN model

kn = KNeighborsClassifier(n_neighbors=6, weights='uniform')
kn = kn.fit(X_train_vectorized.toarray(), y_train)

In [None]:
# Applying and evaluating the KNN model

train['predicted_knn'] = kn.predict(X_train_vectorized.toarray())
validate['predicted_knn'] = kn.predict(X_validate_vectorized.toarray())
print('Train:', (train.actual == train.predicted_knn).mean()), print('Validate:', (validate.actual == validate.predicted_knn).mean())

## Testing on KNN

In [None]:
# Since KNN is my best performing model on validate, I'm gonna go ahead and test on it

test['predicted_knn'] = kn.predict(X_test_vectorized.toarray())

In [None]:
print('Test Performance:', (test.actual == test.predicted_knn).mean())

A performance more than twice as good as the baseline is a great improvement!