# Instructions

### Acquire:
- Decide on a list of GitHub repositories to scrape. You can use the links provided in the project guidance, or you can generate a list programmatically using web scraping techniques.
- Use a script to scrape the README files for each repository and store the data in a CSV or JSON file. Make sure to include the programming language of each repository as a label.
- Document where your data comes from, including the date of acquisition.

### Prepare:
- Load the data into a Pandas DataFrame.
- Explore and clean the data as needed. This might include removing duplicates, handling missing data, and removing stop words.
- Split the data into training and testing sets.

### Explore:
- Explore and visualize the natural language data that you have acquired. For example, you could:
- Count the most common words in READMEs.
- Visualize the distribution of README lengths by programming language.
- Explore the number of unique words used by different programming languages.
- Identify any words that uniquely identify a programming language.

### Model:
- Transform the text of the README files into a form that can be used in a machine learning model. This might involve using techniques like bag-of-words, TF-IDF, or word embeddings.
- Try fitting several different models, such as logistic regression, decision trees, or random forests.
- Evaluate the performance of each model using metrics like accuracy, precision, and recall.
- Build a function that takes in the text of a README file and tries to predict the programming language.
- Consider narrowing down the number of unique values in your target variable if you have many different programming languages. For example, you could focus on the top 3 languages and label everything else as "Other".


### Deliverables:
- Create a well-documented Jupyter notebook that contains your analysis.
- Create a README file that describes your project and includes instructions on how to run it.
- Create 2-5 Google slides suitable for a general audience that summarize your findings in exploration and the results of your modeling. Include well-labeled visualizations in your slides.
- Link your Google slide deck in the README of your repository.
- I hope this helps you get started with your project! Good luck!

In [1]:
import unicodedata
import re
import json
import os
from requests import get
from bs4 import BeautifulSoup

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import numpy as np

import prepare as p
import explore as e

# Acquire

In [2]:
# read JSON output from Acquire script
# create Pandas DataFrame
df=pd.read_json('data.json')

In [3]:
# My repos language breakdowns
df['language'].value_counts()

JavaScript    7
Python        6
Rust          4
TypeScript    3
C             2
Shell         1
Go            1
C#            1
QML           1
Name: language, dtype: int64

# Prepare

In [4]:
# Some repos are topic specific so this is a list of additional stopwords
# to reduce noise
stopwords = ['covid','covid-19',"'"]

In [5]:
# languages we are focusing on?
languages = ['Python', 'JavaScript', 'Jupyter Notebook', 'HTML', 'R']

In [6]:
# add new prepared columns to dataframe from JSON
prepared_df = p.prep_repo_data(df,'readme_contents', extra_words=stopwords)

# Explore

In [7]:
# split data
train, validate, test = e.split_data(prepared_df) 

In [8]:
# language breakdown
language_totals = pd.concat([train.language.value_counts(),
                    train.language.value_counts(normalize=True)], axis=1)
language_totals.columns = ['count', 'percentage']
language_totals

Unnamed: 0,count,percentage
JavaScript,4,0.333333
Python,2,0.166667
TypeScript,2,0.166667
C#,1,0.083333
C,1,0.083333
Go,1,0.083333
Rust,1,0.083333


In [9]:
# subset word Frequency by Language

In [10]:
train[train['language'] == 'JavaScript']['clean']

16    p aligncenter br img altlogo srcmedialogopng b...
6     atom build statushttpsdevazurecomgithubatomapi...
14    p aligncenterimg srcstaticlogosmallpng altmark...
4     nodejs nodejs opensource crossplatform javascr...
Name: clean, dtype: object

In [11]:
# tokenize words return value counts

In [12]:
train[train['language'] == 'JavaScript']

Unnamed: 0,language,repo,original,clean,stemmed,lemmatized
16,JavaScript,GitSquared/edex-ui,"<p align=""center"">\n <br>\n <img alt=""Logo"" ...",p aligncenter br img altlogo srcmedialogopng b...,p aligncent br img altlogo srcmedialogopng brb...,p aligncenter br img altlogo srcmedialogopng b...
6,JavaScript,atom/atom,# Atom\n\n[![Build status](https://dev.azure.c...,atom build statushttpsdevazurecomgithubatomapi...,atom build statushttpsdevazurecomgithubatomapi...,atom build statushttpsdevazurecomgithubatomapi...
14,JavaScript,marktext/marktext,"<p align=""center""><img src=""static/logo-small....",p aligncenterimg srcstaticlogosmallpng altmark...,p aligncenterimg srcstaticlogosmallpng altmark...,p aligncenterimg srcstaticlogosmallpng altmark...
4,JavaScript,nodejs/node,"# Node.js\n\nNode.js is an open-source, cross-...",nodejs nodejs opensource crossplatform javascr...,nodej nodej opensourc crossplatform javascript...,nodejs nodejs opensource crossplatform javascr...


In [13]:
def get_wordcounts_for_language(df, column, language):
    # filter dataframe to rows with specified language
    language_df = df[df['language'] == language]

    # create a new series by splitting the text in the specified column into individual words
    words_series = language_df[column].str.split(expand=True).stack()

    # group the series by word and count the number of occurrences
    words_counts = words_series.groupby(words_series).count()
    
    # sort value counts in descending order
    words_counts = words_counts.sort_values(ascending=False) 
    
    return words_counts

In [36]:
words_counts = get_wordcounts_for_language(train, 'lemmatized', 'Python')
words_counts[:5]

kivy              13
targetblankimg    11
python             7
email              6
license            6
dtype: int64

In [15]:
def get_unique_words_for_language(df, column, language):
    # filter dataframe by rows with the specified language
    language_df = df[df['language'] == language]

    # create a new series by splitting the text in the specified column into individual words
    # using Pandas explode method
    words_series = language_df[column].str.split(expand=True).stack()

    # get a series holding each unique word for the specified language
    unique_words_series = pd.Series(words_series.unique()).reset_index(drop=True)

    return unique_words_series

In [16]:
unique_python_words = get_unique_words_for_language(train,'lemmatized','Python')
unique_python_words

0                                                   kivy
1                                                    img
2                                             alignright
3                                              height256
4      srchttpsrawgithubusercontentcomkivykivymasterk...
                             ...                        
353                      linuxkernelassetslinuxkernelpng
354                                               author
355                                               byncsa
356                                             creative
357        commonshttpcreativecommonsorglicensesbyncsa40
Length: 358, dtype: object

In [17]:
def get_words_for_language(df, column, language):
    # filter dataframe by rows with the specified language
    language_df = df[df['language'] == language]

    # create a new series by splitting the text in the specified column into individual words
    # using Pandas explode method
    words_series = language_df[column].str.split(expand=True).stack()
    
    # reset index to remove double index caused by expand=True
    words_series = words_series.reset_index(drop=True)
    
    return words_series

In [18]:
python_words = get_words_for_language(train,'lemmatized','Python')
python_words

0                                                   kivy
1                                                    img
2                                             alignright
3                                              height256
4      srchttpsrawgithubusercontentcomkivykivymasterk...
                             ...                        
536                                              license
537                                             licensed
538                                               byncsa
539                                             creative
540        commonshttpcreativecommonsorglicensesbyncsa40
Length: 541, dtype: object

In [19]:
train['language'].value_counts()

JavaScript    4
Python        2
TypeScript    2
C#            1
C             1
Go            1
Rust          1
Name: language, dtype: int64

In [20]:
get_wordcounts_for_language(train,'lemmatized','C')

linux        47
o            20
file         17
supported    16
boot         16
             ..
gnewsense     1
gnu           1
gnulinux      1
gobolinux     1
zstack        1
Length: 625, dtype: int64

In [21]:
train[train['language'] == 'C']['lemmatized']

7    h1 aligncenter hrefhttpswwwventoynetventoya h1...
Name: lemmatized, dtype: object

In [22]:
column_name = 'lemmatized'
grouped_df = train.groupby('language')
for language, group in grouped_df:
    # get the value counts for the specified column
    word_counts = group[column_name].str.split(expand=True).stack().value_counts()

In [23]:
word_counts_sorted = word_counts.sort_values(ascending=False)
top_words = word_counts_sorted.iloc[:3]
type(top_words)

pandas.core.series.Series

In [24]:
train.lemmatized

15    logo powershell welcome powershell github comm...
7     h1 aligncenter hrefhttpswwwventoynetventoya h1...
26    kivy img alignright height256 srchttpsrawgithu...
10    httpsassetsvercelcomimageuploadv1549723846repo...
20    linuxinsides bookinprogress linux kernel insid...
11    mkcert mkcert simple tool making locallytruste...
16    p aligncenter br img altlogo srcmedialogopng b...
6     atom build statushttpsdevazurecomgithubatomapi...
5     p aligncenter hrefhttpsgithubcomtrimstraythebo...
23    p aligncenter img width180 srcpubliclogopng al...
18    nativefier example nativefier app macos dockgi...
14    p aligncenterimg srcstaticlogosmallpng altmark...
4     nodejs nodejs opensource crossplatform javascr...
Name: lemmatized, dtype: object

In [29]:
def get_top_words_by_language(df, column_name, n_top_words):
    # group the dataframe by language
    grouped_df = df.groupby('language')
    
    # initiialize an empty dictionary to store the results
    results = {}
    
    # loop over each group
    for language, group in grouped_df:
        # get the value counts for the specified column
        word_counts = group[column_name].str.split(expand=True).stack().value_counts()
        # sort the value counts in descending order
        word_counts_sorted = word_counts.sort_values(ascending=False)
        # get the top n words
        top_words = word_counts_sorted.iloc[:n_top_words]
        
        # add the language and top words to the results dictionary
        results[language] = top_words
        
    # get the value counts for the specified column for the ungrouped dataframe
    total_word_counts = df[column_name].str.split(expand=True).stack().value_counts()
    # sort the value counts in descending order
    total_word_counts_sorted = total_word_counts.sort_values(ascending=False)
    # get the top n words
    top_total_words = total_word_counts_sorted.iloc[:n_top_words]
    
    # add the total top words to the results dictionary
    results['total'] = top_total_words
    
    # create a dataframe from the results dictionary
    df_results = pd.DataFrame(results)
    
    # replace NaN values with 0
    df_results.fillna(0, inplace=True)
    
    # sort the columns by the total values
    df_results = df_results.sort_values(by='total',ascending=False)[:n_top_words]
    
    return df_results

In [34]:
lemmatized_language_value_counts = get_top_words_by_language(train, 'lemmatized',10)

In [35]:
lemmatized_language_value_counts

Unnamed: 0,C,C#,Go,JavaScript,Python,Rust,TypeScript,total
nbspnbspsmallorangediamond,0.0,0.0,0.0,0.0,0.0,0.0,0.0,908.0
bash,0.0,0.0,0.0,0.0,0.0,0.0,0.0,317.0
p,0.0,0.0,0.0,0.0,0.0,0.0,0.0,273.0
tool,0.0,0.0,0.0,0.0,0.0,0.0,0.0,175.0
security,0.0,0.0,0.0,0.0,0.0,0.0,0.0,166.0
hehim,0.0,0.0,0.0,137.0,0.0,0.0,0.0,137.0
file,17.0,0.0,10.0,0.0,0.0,0.0,0.0,124.0
linux,47.0,9.0,0.0,0.0,0.0,0.0,0.0,120.0
web,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.0
blacksmallsquare,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.0


In [38]:
train.head()

Unnamed: 0,language,repo,original,clean,stemmed,lemmatized
15,C#,PowerShell/PowerShell,# ![logo][] PowerShell\n\nWelcome to the Power...,logo powershell welcome powershell github comm...,logo powershel welcom powershel github commun ...,logo powershell welcome powershell github comm...
7,C,ventoy/Ventoy,"<h1 align=""center"">\n <a href=https://www.ven...",h1 aligncenter hrefhttpswwwventoynetventoya h1...,h1 aligncent hrefhttpswwwventoynetventoya h1 p...,h1 aligncenter hrefhttpswwwventoynetventoya h1...
26,Python,kivy/kivy,"Kivy\n====\n\n<img align=""right"" height=""256"" ...",kivy img alignright height256 srchttpsrawgithu...,kivi img alignright height256 srchttpsrawgithu...,kivy img alignright height256 srchttpsrawgithu...
10,TypeScript,vercel/hyper,![](https://assets.vercel.com/image/upload/v15...,httpsassetsvercelcomimageuploadv1549723846repo...,httpsassetsvercelcomimageuploadv1549723846repo...,httpsassetsvercelcomimageuploadv1549723846repo...
20,Python,0xAX/linux-insides,linux-insides\n===============\n\nA book-in-pr...,linuxinsides bookinprogress linux kernel insid...,linuxinsid bookinprogress linux kernel insid g...,linuxinsides bookinprogress linux kernel insid...


In [48]:
def language_breakdown(df,column):
    # Group by "language" column and sum the word count for each group
    word_count_by_language = df.groupby("language")[column].apply(lambda x: x.str.split().str.len().sum()).reset_index(name="word_count")

    # Calculate the total word count for all languages
    total_word_count = word_count_by_language["word_count"].sum()

    # Calculate the proportion of total words in each language
    word_count_by_language["percentage"] = word_count_by_language["word_count"] / total_word_count

    # format percentage column
    word_count_by_language["percentage"] = word_count_by_language["percentage"].map("{:.2%}".format)
    
    # sort and fix index
    word_count_by_language = word_count_by_language.sort_values(by='word_count',ascending=False).reset_index(drop=Tr)
    
    return word_count_by_language

language_breakdown(train,'lemmatized')

Unnamed: 0,language,word_count,percentage
0,JavaScript,3517,48.12%
1,C,938,12.83%
2,C#,725,9.92%
3,TypeScript,630,8.62%
4,Go,545,7.46%
5,Python,541,7.40%
6,Rust,413,5.65%


# Next Steps

1. After preprocessing,  create a TF-IDF matrix from text data.  scikit-learn's TfidfVectorizer class to do this. This will transform text data into a sparse matrix where each row represents a document and each column represents a unique word in entire corpus of documents. The values in this matrix represent the importance of each word in each document, based on the TF-IDF measure.

2.  then use this TF-IDF matrix as input to a machine learning model.  split dataset into training and testing sets, and then fit a machine learning model, such as a logistic regression or a random forest classifier, on the training set.  fit() method of the classifier to train the model.

3. After training the model,  evaluate its performance on the testing set.  predict() method of the classifier to predict the programming language for each document in the testing set, and then compare these predictions to the actual programming language labels in the testing set to calculate various performance metrics, such as accuracy, precision, recall, and F1-score.

4. use trained model to predict the programming language for new, unseen documents.  preprocess the text data in the same way as before, and then transform it into a TF-IDF matrix using the same TfidfVectorizer object that you fit on the training set.  then use the predict() method of the trained classifier to predict the programming language for each new document based on the TF-IDF matrix.