# Spotify Natural Language Processing Project




### By: Daniel Ford, Glady Barrios, Kevin Smith

----

# Project Goal

- The goal of this project is to use natural language processing and classification models to identify terms for predicting a readme's primary language on Github.

---

# Inital Questions 

 - What are the top 5 programming languages when searching for 'Spotify' repos on github?
 - What are the most common words we would see when searching for spotify README's
 - From these top 5 programing languages what are the most common words from these languages 
 - What are some common bigrams in the languages when using these bigrams


### Mini-data dictionary

---
| Attribute | Definition | Data Type |
| ----- | ----- | ----- |
|Repo |The username of the REPO|object |
|readme_contents |What is inside the readme | object|
|language|the programming language |object |
|lemmatized|prepared data |object |


### Important Libraries

In [1]:
import pandas as pd
import numpy as np
import unicodedata
import re, os
import json
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from scipy.stats import zscore


import time
import random
from requests import get
from bs4 import BeautifulSoup
#the aquire takes a while
import acquire

import prepare

import warnings
warnings.filterwarnings("ignore")

# Acquire

In [2]:
df = acquire.github_df()

In [3]:
df.head()

Unnamed: 0,repo,language,readme_contents
0,zmb3/spotify,Go,\nSpotify\n=======\n\n[![GoDoc](https://godoc....
1,spotDL/spotify-downloader,Python,<!--- mdformat-toc start --slug=github --->\n\...
2,Spotifyd/spotifyd,Rust,# Spotifyd <!-- omit in toc -->\n<!-- ALL-CONT...
3,Rigellute/spotify-tui,Rust,# Spotify TUI\n\n![Continuous Integration](htt...
4,JohnnyCrazy/SpotifyAPI-NET,C#,"\n<h1 align=""center"">\n <p align=""center"">Spo..."


# Prepare 

In [5]:
df = df.dropna()

In [6]:
prepare.prepare_df(df,'readme_contents', extra_words = ['also', '&#9', 'e', 'f', 'ou', 'et', 'n', '1', "'", ';', '3', 'e', 'p'])

Unnamed: 0,repo,language,readme_contents,clean,stemmed,lemmatized,original_length,stem_length,lem_length,original_word_count,stemmed_word_count,lemmatized_word_count
0,zmb3/spotify,other,\nSpotify\n=======\n\n[![GoDoc](https://godoc....,spotify godochttpsgodocorggithubcomzmb3spotify...,spotifi godochttpsgodocorggithubcomzmb3spotify...,spotify godochttpsgodocorggithubcomzmb3spotify...,3918,2352,2640,503,276,276
1,spotDL/spotify-downloader,Python,<!--- mdformat-toc start --slug=github --->\n\...,mdformattoc start sluggithub editing readme en...,mdformattoc start sluggithub edit readm ensur ...,mdformattoc start sluggithub editing readme en...,4300,2834,3021,471,277,277
2,Spotifyd/spotifyd,other,# Spotifyd <!-- omit in toc -->\n<!-- ALL-CONT...,spotifyd omit toc allcontributorsbadgestart re...,spotifyd omit toc allcontributorsbadgestart re...,spotifyd omit toc allcontributorsbadgestart re...,2177,1482,1576,231,136,136
3,Rigellute/spotify-tui,other,# Spotify TUI\n\n![Continuous Integration](htt...,spotify tui continuous integrationhttpsgithubc...,spotifi tui continu integrationhttpsgithubcomr...,spotify tui continuous integrationhttpsgithubc...,59878,32960,34049,3857,2495,2495
4,JohnnyCrazy/SpotifyAPI-NET,C#,"\n<h1 align=""center"">\n <p align=""center"">Spo...",h1 aligncenter aligncenterspotifyapinetp hrefh...,h1 aligncent aligncenterspotifyapinetp hrefhtt...,h1 aligncenter aligncenterspotifyapinetp hrefh...,2782,1881,2009,272,169,169
...,...,...,...,...,...,...,...,...,...,...,...,...
994,veeraya/8tracks-to-Spotify,JavaScript,"This is a userscript, written in Javascript, t...",userscript written javascript converts 8tracks...,userscript written javascript convert 8track p...,userscript written javascript convert 8tracks ...,797,536,576,113,68,68
995,Luki120/PerfectSpotify,other,# PerfectSpotify\n\n![PS](https://twickd.com/i...,perfectspotify pshttpstwickdcomimagescf5946037...,perfectspotifi pshttpstwickdcomimagescf5946037...,perfectspotify pshttpstwickdcomimagescf5946037...,1903,1384,1480,307,199,199
996,NicolasConstant/Spotify-Sleep-Mode-Stopper,C#,# Spotify Sleep Mode Stopper\n\n## Synopsis \n...,spotify sleep mode stopper synopsis spotify de...,spotifi sleep mode stopper synopsi spotifi des...,spotify sleep mode stopper synopsis spotify de...,824,469,540,123,66,66
997,mattiasahlsen/spotify-queue,JavaScript,# spotify-app\n\n## deployed on\nhttps://colla...,spotifyapp deployed httpscollabqueuecom projec...,spotifyapp deploy httpscollabqueuecom project ...,spotifyapp deployed httpscollabqueuecom projec...,355,216,247,48,27,27


In [7]:
df.language.value_counts()

JavaScript           234
Python               187
TypeScript            73
Shell                 35
Java                  34
Objective-C           30
Jupyter Notebook      26
C#                    26
CSS                   26
HTML                  25
PHP                   23
Ruby                  22
Swift                 22
Go                    19
Rust                  17
C                     15
C++                   15
Kotlin                14
Dart                  13
Vue                    9
R                      6
CoffeeScript           6
Emacs Lisp             5
Logos                  4
PowerShell             4
Elixir                 3
Perl                   3
AppleScript            3
Batchfile              3
Jinja                  2
HCL                    2
Lua                    2
VimL                   2
OCaml                  2
Scala                  2
Dockerfile             2
AutoHotkey             2
QML                    2
Haskell                1
Makefile               1


### Split the data

# Exploration 

# Modeling

# Conclusion 