## Predicting Programming Languages
### Natural Language Processing Among GitHub Repositories
By: _AJ Martinez,        
Ben Smith,        
Nicholas Dougherty_          

In [5]:
import pandas as pd
import numpy as np
import unicodedata
import re
import nltk

# imports for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image

# import modules 
from prepare import * 
import acquire 
#import explore 
#import model 

# imports for NLP extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# imports for modeling
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, recall_score, plot_confusion_matrix
from sklearn.linear_model import LogisticRegression

***
## Overview and Goals

The goal of this project is to determine the main coding language of a project based on the contents of its README using NLP methods. We acquired data from trending pages on GitHub. In order to recreate this project you will need to access the json of the data we acquired, specifically "data1.json".

A total of 109 repositories were scraped initially. We filtered for JavaScript, HTML and Python, with other languages categorized as 'Other'. 

## Findings

We found that a () model using Lemmatized data performed the highest with an accuracy of () on the validate data set. With a final test accuracy of (). This outperformed our baseline accuracy of (). Our model was (list results and whatnot.)

### With More Time

We'd like to acquire more data to see if we can improve the results for distinguishing among (). Our sample size was fairly small during this project.
*** 
***
## Acquisition and Preparation

Data was obtained through functions that scraped repository collections on GitHub. First we manually explored GitHub using Chrome to inspect the HTML elements; the requests module obtained the HTML as a list of Universal Resource Locator (URL) endpoints, which were garnered from the trending portal and then appended to the origin. BeautifulSoup was essential in this regard. We scripted the process of requesting other pages, obtaining the README data form those as well through over a hundred repositories.       
Here is a segment of the code used, the full code  can be viewed in the acquire.py script elsewhere in our repository. 

```
# create an empty list to store endpoints
    endpoints = []
    # go to each url - trending repos daily, weekly, and monthly
    for url in ['https://github.com/trending?since=daily&spoken_language_code=en',
                'https://github.com/trending?since=weekly&spoken_language_code=en',
                'https://github.com/trending?since=monthly&spoken_language_code=en']:
        # get the response
        response = get(url)
        # create the beautiful soup object; It creates a parse tree from page source code
        soup = BeautifulSoup(response.text, 'html.parser')
        # identify html objects containing each repository
        for repo in soup.select('.Box-row'):
            # pull out the url endpoint for that repo and append to the list
            endpoints.append(repo
                             .select_one('h1')
                             .select_one('a')
                             .attrs['href'])
```
The assimilated data was stored in a .json file, which was then used to obtain our Dataframe, like so:

In [26]:
# Read in data from the JSON created through acquire
df = pd.read_json('data1.json')
# View the content of the first row 
df.head(1)

Unnamed: 0,repo,language,readme_contents
0,google/googletest,C++,"# GoogleTest\n\n### Announcements\n\n#### Live at Head\n\nGoogleTest now follows the\n[Abseil Live at Head philosophy](https://abseil.io/about/philosophy#upgrade-support).\nWe recommend\n[updating to the latest commit in the `main` branch as often as possible](https://github.com/abseil/abseil-cpp/blob/master/FAQ.md#what-is-live-at-head-and-how-do-i-do-it).\n\n#### Documentation Updates\n\nOur documentation is now live on GitHub Pages at\nhttps://google.github.io/googletest/. We recommend browsing the documentation on\nGitHub Pages rather than directly in the repository.\n\n#### Release 1.11.0\n\n[Release 1.11.0](https://github.com/google/googletest/releases/tag/release-1.11.0)\nis now available.\n\n#### Coming Soon\n\n* We are planning to take a dependency on\n [Abseil](https://github.com/abseil/abseil-cpp).\n* More documentation improvements are planned.\n\n## Welcome to **GoogleTest**, Google's C++ test framework!\n\nThis repository is a merger of the formerly separate GoogleTest and GoogleMock\nprojects. These were so closely related that it makes sense to maintain and\nrelease them together.\n\n### Getting Started\n\nSee the [GoogleTest User's Guide](https://google.github.io/googletest/) for\ndocumentation. We recommend starting with the\n[GoogleTest Primer](https://google.github.io/googletest/primer.html).\n\nMore information about building GoogleTest can be found at\n[googletest/README.md](googletest/README.md).\n\n## Features\n\n* An [xUnit](https://en.wikipedia.org/wiki/XUnit) test framework.\n* Test discovery.\n* A rich set of assertions.\n* User-defined assertions.\n* Death tests.\n* Fatal and non-fatal failures.\n* Value-parameterized tests.\n* Type-parameterized tests.\n* Various options for running the tests.\n* XML test report generation.\n\n## Supported Platforms\n\nGoogleTest requires a codebase and compiler compliant with the C++11 standard or\nnewer.\n\nThe GoogleTest code is officially supported on the following platforms.\nOperating systems or tools not listed below are community-supported. For\ncommunity-supported platforms, patches that do not complicate the code may be\nconsidered.\n\nIf you notice any problems on your platform, please file an issue on the\n[GoogleTest GitHub Issue Tracker](https://github.com/google/googletest/issues).\nPull requests containing fixes are welcome!\n\n### Operating Systems\n\n* Linux\n* macOS\n* Windows\n\n### Compilers\n\n* gcc 5.0+\n* clang 5.0+\n* MSVC 2015+\n\n**macOS users:** Xcode 9.3+ provides clang 5.0+.\n\n### Build Systems\n\n* [Bazel](https://bazel.build/)\n* [CMake](https://cmake.org/)\n\n**Note:** Bazel is the build system used by the team internally and in tests.\nCMake is supported on a best-effort basis and by the community.\n\n## Who Is Using GoogleTest?\n\nIn addition to many internal projects at Google, GoogleTest is also used by the\nfollowing notable projects:\n\n* The [Chromium projects](http://www.chromium.org/) (behind the Chrome browser\n and Chrome OS).\n* The [LLVM](http://llvm.org/) compiler.\n* [Protocol Buffers](https://github.com/google/protobuf), Google's data\n interchange format.\n* The [OpenCV](http://opencv.org/) computer vision library.\n\n## Related Open Source Projects\n\n[GTest Runner](https://github.com/nholthaus/gtest-runner) is a Qt5 based\nautomated test-runner and Graphical User Interface with powerful features for\nWindows and Linux platforms.\n\n[GoogleTest UI](https://github.com/ospector/gtest-gbar) is a test runner that\nruns your test binary, allows you to track its progress via a progress bar, and\ndisplays a list of test failures. Clicking on one shows failure text. GoogleTest\nUI is written in C#.\n\n[GTest TAP Listener](https://github.com/kinow/gtest-tap-listener) is an event\nlistener for GoogleTest that implements the\n[TAP protocol](https://en.wikipedia.org/wiki/Test_Anything_Protocol) for test\nresult output. If your test runner understands TAP, you may find it useful.\n\n[gtest-parallel](https://github.com/google/gtest-parallel) is a test runner that\nruns tests from your binary in parallel to provide significant speed-up.\n\n[GoogleTest Adapter](https://marketplace.visualstudio.com/items?itemName=DavidSchuldenfrei.gtest-adapter)\nis a VS Code extension allowing to view GoogleTest in a tree view and run/debug\nyour tests.\n\n[C++ TestMate](https://github.com/matepek/vscode-catch2-test-adapter) is a VS\nCode extension allowing to view GoogleTest in a tree view and run/debug your\ntests.\n\n[Cornichon](https://pypi.org/project/cornichon/) is a small Gherkin DSL parser\nthat generates stub code for GoogleTest.\n\n## Contributing Changes\n\nPlease read\n[`CONTRIBUTING.md`](https://github.com/google/googletest/blob/master/CONTRIBUTING.md)\nfor details on how to contribute to this project.\n\nHappy testing!\n"


From here, we break our data down into smaller component via parsing tools from the nltk packaged library. 
- All text was converted to lowercase for the sake of normalcy
- Removed:
    - accented, non-ASCII characters
    - special characters
    - stopwords
- Words were stemmed and lemmatized as well. 

All of these processes were combined into a single function:

In [27]:
# prepare the dataframe and return text stemmed, lemmatized, cleaned, tokenized, et cetera
df = prep_repos(df)
# This gives us the following. 
# Original content in one column; a clean column; another that is stemmed; lemmatized; and the languages
df.head(2)

Unnamed: 0,repo,language,original,clean,stemmed,lemmatized,language_reduced
0,google/googletest,C++,"# GoogleTest\n\n### Announcements\n\n#### Live at Head\n\nGoogleTest now follows the\n[Abseil Live at Head philosophy](https://abseil.io/about/philosophy#upgrade-support).\nWe recommend\n[updating to the latest commit in the `main` branch as often as possible](https://github.com/abseil/abseil-cpp/blob/master/FAQ.md#what-is-live-at-head-and-how-do-i-do-it).\n\n#### Documentation Updates\n\nOur documentation is now live on GitHub Pages at\nhttps://google.github.io/googletest/. We recommend browsing the documentation on\nGitHub Pages rather than directly in the repository.\n\n#### Release 1.11.0\n\n[Release 1.11.0](https://github.com/google/googletest/releases/tag/release-1.11.0)\nis now available.\n\n#### Coming Soon\n\n* We are planning to take a dependency on\n [Abseil](https://github.com/abseil/abseil-cpp).\n* More documentation improvements are planned.\n\n## Welcome to **GoogleTest**, Google's C++ test framework!\n\nThis repository is a merger of the formerly separate GoogleTest and GoogleMock\nprojects. These were so closely related that it makes sense to maintain and\nrelease them together.\n\n### Getting Started\n\nSee the [GoogleTest User's Guide](https://google.github.io/googletest/) for\ndocumentation. We recommend starting with the\n[GoogleTest Primer](https://google.github.io/googletest/primer.html).\n\nMore information about building GoogleTest can be found at\n[googletest/README.md](googletest/README.md).\n\n## Features\n\n* An [xUnit](https://en.wikipedia.org/wiki/XUnit) test framework.\n* Test discovery.\n* A rich set of assertions.\n* User-defined assertions.\n* Death tests.\n* Fatal and non-fatal failures.\n* Value-parameterized tests.\n* Type-parameterized tests.\n* Various options for running the tests.\n* XML test report generation.\n\n## Supported Platforms\n\nGoogleTest requires a codebase and compiler compliant with the C++11 standard or\nnewer.\n\nThe GoogleTest code is officially supported on the following platforms.\nOperating systems or tools not listed below are community-supported. For\ncommunity-supported platforms, patches that do not complicate the code may be\nconsidered.\n\nIf you notice any problems on your platform, please file an issue on the\n[GoogleTest GitHub Issue Tracker](https://github.com/google/googletest/issues).\nPull requests containing fixes are welcome!\n\n### Operating Systems\n\n* Linux\n* macOS\n* Windows\n\n### Compilers\n\n* gcc 5.0+\n* clang 5.0+\n* MSVC 2015+\n\n**macOS users:** Xcode 9.3+ provides clang 5.0+.\n\n### Build Systems\n\n* [Bazel](https://bazel.build/)\n* [CMake](https://cmake.org/)\n\n**Note:** Bazel is the build system used by the team internally and in tests.\nCMake is supported on a best-effort basis and by the community.\n\n## Who Is Using GoogleTest?\n\nIn addition to many internal projects at Google, GoogleTest is also used by the\nfollowing notable projects:\n\n* The [Chromium projects](http://www.chromium.org/) (behind the Chrome browser\n and Chrome OS).\n* The [LLVM](http://llvm.org/) compiler.\n* [Protocol Buffers](https://github.com/google/protobuf), Google's data\n interchange format.\n* The [OpenCV](http://opencv.org/) computer vision library.\n\n## Related Open Source Projects\n\n[GTest Runner](https://github.com/nholthaus/gtest-runner) is a Qt5 based\nautomated test-runner and Graphical User Interface with powerful features for\nWindows and Linux platforms.\n\n[GoogleTest UI](https://github.com/ospector/gtest-gbar) is a test runner that\nruns your test binary, allows you to track its progress via a progress bar, and\ndisplays a list of test failures. Clicking on one shows failure text. GoogleTest\nUI is written in C#.\n\n[GTest TAP Listener](https://github.com/kinow/gtest-tap-listener) is an event\nlistener for GoogleTest that implements the\n[TAP protocol](https://en.wikipedia.org/wiki/Test_Anything_Protocol) for test\nresult output. If your test runner understands TAP, you may find it useful.\n\n[gtest-parallel](https://github.com/google/gtest-parallel) is a test runner that\nruns tests from your binary in parallel to provide significant speed-up.\n\n[GoogleTest Adapter](https://marketplace.visualstudio.com/items?itemName=DavidSchuldenfrei.gtest-adapter)\nis a VS Code extension allowing to view GoogleTest in a tree view and run/debug\nyour tests.\n\n[C++ TestMate](https://github.com/matepek/vscode-catch2-test-adapter) is a VS\nCode extension allowing to view GoogleTest in a tree view and run/debug your\ntests.\n\n[Cornichon](https://pypi.org/project/cornichon/) is a small Gherkin DSL parser\nthat generates stub code for GoogleTest.\n\n## Contributing Changes\n\nPlease read\n[`CONTRIBUTING.md`](https://github.com/google/googletest/blob/master/CONTRIBUTING.md)\nfor details on how to contribute to this project.\n\nHappy testing!\n",googletest announcements live head googletest follows abseil live head philosophy https abseilio philosophyupgradesupport recommend updating latest commit main branch often possible https githubcom abseil abseilcpp blob master faqmdwhatisliveatheadandhowdoidoit documentation updates documentation live github pages https googlegithubio googletest recommend browsing documentation github pages rather directly repository release 1110 release 1110 https githubcom google googletest releases tag release1110 available coming soon planning take dependency abseil https githubcom abseil abseilcpp documentation improvements planned welcome googletest googles c test framework repository merger formerly separate googletest googlemock projects closely related makes sense maintain release together getting started see googletest users guide https googlegithubio googletest documentation recommend starting googletest primer https googlegithubio googletest primerhtml information building googletest found googletest readmemd googletest readmemd features xunit https enwikipediaorg wiki xunit test framework test discovery rich set assertions userdefined assertions death tests fatal nonfatal failures valueparameterized tests typeparameterized tests various options running tests xml test report generation supported platforms googletest requires codebase compiler compliant c11 standard newer googletest code officially supported following platforms operating systems tools listed communitysupported communitysupported platforms patches complicate code may considered notice problems platform please file issue googletest github issue tracker https githubcom google googletest issues pull requests containing fixes welcome operating systems linux macos windows compilers gcc 50 clang 50 msvc 2015 macos users xcode 93 provides clang 50 build systems bazel https bazelbuild cmake https cmakeorg note bazel build system used team internally tests cmake supported besteffort basis community using googletest addition many internal projects google googletest also used following notable projects chromium projects http wwwchromiumorg behind chrome browser chrome os llvm http llvmorg compiler protocol buffers https githubcom google protobuf googles data interchange format opencv http opencvorg computer vision library related open source projects gtest runner https githubcom nholthaus gtestrunner qt5 based automated testrunner graphical user interface powerful features windows linux platforms googletest ui https githubcom ospector gtestgbar test runner runs test binary allows track progress via progress bar displays list test failures clicking one shows failure text googletest ui written c gtest tap listener https githubcom kinow gtesttaplistener event listener googletest implements tap protocol https enwikipediaorg wiki testanythingprotocol test result output test runner understands tap may find useful gtestparallel https githubcom google gtestparallel test runner runs tests binary parallel provide significant speedup googletest adapter https marketplacevisualstudiocom itemsitemnamedavidschuldenfreigtestadapter vs code extension allowing view googletest tree view run debug tests c testmate https githubcom matepek vscodecatch2testadapter vs code extension allowing view googletest tree view run debug tests cornichon https pypiorg project cornichon small gherkin dsl parser generates stub code googletest contributing changes please read contributingmd https githubcom google googletest blob master contributingmd details contribute project happy testing,googletest announc live head googletest follow abseil live head philosophi http abseilio philosophyupgradesupport recommend updat latest commit main branch often possibl http githubcom abseil abseilcpp blob master faqmdwhatisliveatheadandhowdoidoit document updat document live github page http googlegithubio googletest recommend brows document github page rather directli repositori releas 1110 releas 1110 http githubcom googl googletest releas tag release1110 avail come soon plan take depend abseil http githubcom abseil abseilcpp document improv plan welcom googletest googl c test framework repositori merger formerli separ googletest googlemock project close relat make sens maintain releas togeth get start see googletest user guid http googlegithubio googletest document recommend start googletest primer http googlegithubio googletest primerhtml inform build googletest found googletest readmemd googletest readmemd featur xunit http enwikipediaorg wiki xunit test framework test discoveri rich set assert userdefin assert death test fatal nonfat failur valueparameter test typeparameter test variou option run test xml test report gener support platform googletest requir codebas compil compliant c11 standard newer googletest code offici support follow platform oper system tool list communitysupport communitysupport platform patch complic code may consid notic problem platform pleas file issu googletest github issu tracker http githubcom googl googletest issu pull request contain fix welcom oper system linux maco window compil gcc 50 clang 50 msvc 2015 maco user xcode 93 provid clang 50 build system bazel http bazelbuild cmake http cmakeorg note bazel build system use team intern test cmake support besteffort basi commun use googletest addit mani intern project googl googletest also use follow notabl project chromium project http wwwchromiumorg behind chrome browser chrome os llvm http llvmorg compil protocol buffer http githubcom googl protobuf googl data interchang format opencv http opencvorg comput vision librari relat open sourc project gtest runner http githubcom nholthau gtestrunn qt5 base autom testrunn graphic user interfac power featur window linux platform googletest ui http githubcom ospector gtestgbar test runner run test binari allow track progress via progress bar display list test failur click one show failur text googletest ui written c gtest tap listen http githubcom kinow gtesttaplisten event listen googletest implement tap protocol http enwikipediaorg wiki testanythingprotocol test result output test runner understand tap may find use gtestparallel http githubcom googl gtestparallel test runner run test binari parallel provid signific speedup googletest adapt http marketplacevisualstudiocom itemsitemnamedavidschuldenfreigtestadapt vs code extens allow view googletest tree view run debug test c testmat http githubcom matepek vscodecatch2testadapt vs code extens allow view googletest tree view run debug test cornichon http pypiorg project cornichon small gherkin dsl parser gener stub code googletest contribut chang pleas read contributingmd http githubcom googl googletest blob master contributingmd detail contribut project happi test,googletest announcement live head googletest follows abseil live head philosophy http abseilio philosophyupgradesupport recommend updating latest commit main branch often possible http githubcom abseil abseilcpp blob master faqmdwhatisliveatheadandhowdoidoit documentation update documentation live github page http googlegithubio googletest recommend browsing documentation github page rather directly repository release 1110 release 1110 http githubcom google googletest release tag release1110 available coming soon planning take dependency abseil http githubcom abseil abseilcpp documentation improvement planned welcome googletest google c test framework repository merger formerly separate googletest googlemock project closely related make sense maintain release together getting started see googletest user guide http googlegithubio googletest documentation recommend starting googletest primer http googlegithubio googletest primerhtml information building googletest found googletest readmemd googletest readmemd feature xunit http enwikipediaorg wiki xunit test framework test discovery rich set assertion userdefined assertion death test fatal nonfatal failure valueparameterized test typeparameterized test various option running test xml test report generation supported platform googletest requires codebase compiler compliant c11 standard newer googletest code officially supported following platform operating system tool listed communitysupported communitysupported platform patch complicate code may considered notice problem platform please file issue googletest github issue tracker http githubcom google googletest issue pull request containing fix welcome operating system linux macos window compiler gcc 50 clang 50 msvc 2015 macos user xcode 93 provides clang 50 build system bazel http bazelbuild cmake http cmakeorg note bazel build system used team internally test cmake supported besteffort basis community using googletest addition many internal project google googletest also used following notable project chromium project http wwwchromiumorg behind chrome browser chrome o llvm http llvmorg compiler protocol buffer http githubcom google protobuf google data interchange format opencv http opencvorg computer vision library related open source project gtest runner http githubcom nholthaus gtestrunner qt5 based automated testrunner graphical user interface powerful feature window linux platform googletest ui http githubcom ospector gtestgbar test runner run test binary allows track progress via progress bar display list test failure clicking one show failure text googletest ui written c gtest tap listener http githubcom kinow gtesttaplistener event listener googletest implement tap protocol http enwikipediaorg wiki testanythingprotocol test result output test runner understands tap may find useful gtestparallel http githubcom google gtestparallel test runner run test binary parallel provide significant speedup googletest adapter http marketplacevisualstudiocom itemsitemnamedavidschuldenfreigtestadapter v code extension allowing view googletest tree view run debug test c testmate http githubcom matepek vscodecatch2testadapter v code extension allowing view googletest tree view run debug test cornichon http pypiorg project cornichon small gherkin dsl parser generates stub code googletest contributing change please read contributingmd http githubcom google googletest blob master contributingmd detail contribute project happy testing,Other
1,projectdiscovery/nuclei-templates,Python,"\n\n<h1 align=""center"">\nNuclei Templates\n</h1>\n<h4 align=""center"">Community curated list of templates for the nuclei engine to find security vulnerabilities in applications.</h4>\n\n\n<p align=""center"">\n<a href=""https://github.com/projectdiscovery/nuclei-templates/issues""><img src=""https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat""></a>\n<a href=""https://github.com/projectdiscovery/nuclei-templates/releases""><img src=""https://img.shields.io/github/release/projectdiscovery/nuclei-templates""></a>\n<a href=""https://twitter.com/pdnuclei""><img src=""https://img.shields.io/twitter/follow/pdnuclei.svg?logo=twitter""></a>\n<a href=""https://discord.gg/projectdiscovery""><img src=""https://img.shields.io/discord/695645237418131507.svg?logo=discord""></a>\n</p>\n \n<p align=""center"">\n <a href=""https://nuclei.projectdiscovery.io/templating-guide/"">Documentation</a> •\n <a href=""#-contributions"">Contributions</a> •\n <a href=""#-discussion"">Discussion</a> •\n <a href=""#-community"">Community</a> •\n <a href=""https://nuclei.projectdiscovery.io/faq/templates/"">FAQs</a> •\n <a href=""https://discord.gg/projectdiscovery"">Join Discord</a>\n</p>\n\n----\n\nTemplates are the core of the [nuclei scanner](https://github.com/projectdiscovery/nuclei) which powers the actual scanning engine.\nThis repository stores and houses various templates for the scanner provided by our team, as well as contributed by the community.\nWe hope that you also contribute by sending templates via **pull requests** or [Github issues](https://github.com/projectdiscovery/nuclei-templates/issues/new?assignees=&labels=&template=submit-template.md&title=%5Bnuclei-template%5D+) to grow the list.\n\n\n## Nuclei Templates overview\n\n\nAn overview of the nuclei template project, including statistics on unique tags, author, directory, severity, and type of templates. The table below contains the top ten statistics for each matrix; an expanded version of this is [available here](TEMPLATES-STATS.md), and also available in [JSON](TEMPLATES-STATS.json) format for integration.\n\n<table>\n<tr>\n<td> \n\n## Nuclei Templates Top 10 statistics\n\n| TAG | COUNT | AUTHOR | COUNT | DIRECTORY | COUNT | SEVERITY | COUNT | TYPE | COUNT |\n|-----------|-------|---------------|-------|------------------|-------|----------|-------|---------|-------|\n| cve | 1156 | daffainfo | 560 | cves | 1160 | info | 1192 | http | 3187 |\n| panel | 515 | dhiyaneshdk | 421 | exposed-panels | 523 | high | 874 | file | 68 |\n| lfi | 461 | pikpikcu | 316 | vulnerabilities | 452 | medium | 662 | network | 50 |\n| xss | 367 | pdteam | 262 | technologies | 255 | critical | 414 | dns | 17 |\n| wordpress | 364 | geeknik | 179 | exposures | 204 | low | 183 | | |\n| exposure | 293 | dwisiswant0 | 168 | misconfiguration | 197 | unknown | 6 | | |\n| rce | 291 | princechaddha | 133 | workflows | 186 | | | | |\n| cve2021 | 283 | 0x_akoko | 130 | token-spray | 154 | | | | |\n| tech | 271 | gy741 | 118 | default-logins | 95 | | | | |\n| wp-plugin | 264 | pussycat0x | 116 | file | 68 | | | | |\n\n**261 directories, 3543 files**.\n\n</td>\n</tr>\n</table>\n\n📖 Documentation\n-----\n\nPlease navigate to https://nuclei.projectdiscovery.io for detailed documentation to **build** new or your own **custom** templates.\nWe have also added a set of templates to help you understand how things work.\n\n💪 Contributions\n-----\n\nNuclei-templates is powered by major contributions from the community.\n[Template contributions ](https://github.com/projectdiscovery/nuclei-templates/issues/new?assignees=&labels=&template=submit-template.md&title=%5Bnuclei-template%5D+), [Feature Requests](https://github.com/projectdiscovery/nuclei-templates/issues/new?assignees=&labels=&template=feature_request.md&title=%5BFeature%5D+) and [Bug Reports](https://github.com/projectdiscovery/nuclei-templates/issues/new?assignees=&labels=&template=bug_report.md&title=%5BBug%5D+) are more than welcome.\n\n![Alt](https://repobeats.axiom.co/api/embed/55ee65543bb9a0f9c797626c4e66d472a517d17c.svg ""Repobeats analytics image"")\n\n💬 Discussion\n-----\n\nHave questions / doubts / ideas to discuss?\nFeel free to open a discussion on [Github discussions](https://github.com/projectdiscovery/nuclei-templates/discussions) board.\n\n👨‍💻 Community\n-----\n\nYou are welcome to join the active [Discord Community](https://discord.gg/projectdiscovery) to discuss directly with project maintainers and share things with others around security and automation.\nAdditionally, you may follow us on [Twitter](https://twitter.com/pdnuclei) to be updated on all the things about Nuclei.\n\n\n<p align=""center"">\n<a href=""https://github.com/projectdiscovery/nuclei-templates/graphs/contributors"">\n <img src=""https://contrib.rocks/image?repo=projectdiscovery/nuclei-templates&max=300"">\n</a>\n</p>\n\n\nThanks again for your contribution and keeping this community vibrant. :heart:\n",h1 aligncenter nuclei templates h1 h4 aligncentercommunity curated list templates nuclei engine find security vulnerabilities applications h4 p aligncenter hrefhttps githubcom projectdiscovery nucleitemplates issuesimg srchttps imgshieldsio badge contributionswelcomebrightgreensvgstyleflat hrefhttps githubcom projectdiscovery nucleitemplates releasesimg srchttps imgshieldsio github release projectdiscovery nucleitemplates hrefhttps twittercom pdnucleiimg srchttps imgshieldsio twitter follow pdnucleisvglogotwitter hrefhttps discordgg projectdiscoveryimg srchttps imgshieldsio discord 695645237418131507svglogodiscord p p aligncenter hrefhttps nucleiprojectdiscoveryio templatingguide documentation hrefcontributionscontributions hrefdiscussiondiscussion hrefcommunitycommunity hrefhttps nucleiprojectdiscoveryio faq templates faqs hrefhttps discordgg projectdiscoveryjoin discord p templates core nuclei scanner https githubcom projectdiscovery nuclei powers actual scanning engine repository stores houses various templates scanner provided team well contributed community hope also contribute sending templates via pull requests github issues https githubcom projectdiscovery nucleitemplates issues newassigneeslabelstemplatesubmittemplatemdtitle5bnucleitemplate5d grow list nuclei templates overview overview nuclei template project including statistics unique tags author directory severity type templates table contains top ten statistics matrix expanded version available templatesstatsmd also available json templatesstatsjson format integration table tr td nuclei templates top 10 statistics tag count author count directory count severity count type count cve 1156 daffainfo 560 cves 1160 info 1192 http 3187 panel 515 dhiyaneshdk 421 exposedpanels 523 high 874 file 68 lfi 461 pikpikcu 316 vulnerabilities 452 medium 662 network 50 xss 367 pdteam 262 technologies 255 critical 414 dns 17 wordpress 364 geeknik 179 exposures 204 low 183 exposure 293 dwisiswant0 168 misconfiguration 197 unknown 6 rce 291 princechaddha 133 workflows 186 cve2021 283 0xakoko 130 tokenspray 154 tech 271 gy741 118 defaultlogins 95 wpplugin 264 pussycat0x 116 file 68 261 directories 3543 files td tr table documentation please navigate https nucleiprojectdiscoveryio detailed documentation build new custom templates also added set templates help understand things work contributions nucleitemplates powered major contributions community template contributions https githubcom projectdiscovery nucleitemplates issues newassigneeslabelstemplatesubmittemplatemdtitle5bnucleitemplate5d feature requests https githubcom projectdiscovery nucleitemplates issues newassigneeslabelstemplatefeaturerequestmdtitle5bfeature5d bug reports https githubcom projectdiscovery nucleitemplates issues newassigneeslabelstemplatebugreportmdtitle5bbug5d welcome alt https repobeatsaxiomco api embed 55ee65543bb9a0f9c797626c4e66d472a517d17csvg repobeats analytics image discussion questions doubts ideas discuss feel free open discussion github discussions https githubcom projectdiscovery nucleitemplates discussions board community welcome join active discord community https discordgg projectdiscovery discuss directly project maintainers share things others around security automation additionally may follow us twitter https twittercom pdnuclei updated things nuclei p aligncenter hrefhttps githubcom projectdiscovery nucleitemplates graphs contributors img srchttps contribrocks imagerepoprojectdiscovery nucleitemplatesmax300 p thanks contribution keeping community vibrant heart,h1 aligncent nuclei templat h1 h4 aligncentercommun curat list templat nuclei engin find secur vulner applic h4 p aligncent hrefhttp githubcom projectdiscoveri nucleitempl issuesimg srchttp imgshieldsio badg contributionswelcomebrightgreensvgstyleflat hrefhttp githubcom projectdiscoveri nucleitempl releasesimg srchttp imgshieldsio github releas projectdiscoveri nucleitempl hrefhttp twittercom pdnucleiimg srchttp imgshieldsio twitter follow pdnucleisvglogotwitt hrefhttp discordgg projectdiscoveryimg srchttp imgshieldsio discord 695645237418131507svglogodiscord p p aligncent hrefhttp nucleiprojectdiscoveryio templatingguid document hrefcontributionscontribut hrefdiscussiondiscuss hrefcommunitycommun hrefhttp nucleiprojectdiscoveryio faq templat faq hrefhttp discordgg projectdiscoveryjoin discord p templat core nuclei scanner http githubcom projectdiscoveri nuclei power actual scan engin repositori store hous variou templat scanner provid team well contribut commun hope also contribut send templat via pull request github issu http githubcom projectdiscoveri nucleitempl issu newassigneeslabelstemplatesubmittemplatemdtitle5bnucleitemplate5d grow list nuclei templat overview overview nuclei templat project includ statist uniqu tag author directori sever type templat tabl contain top ten statist matrix expand version avail templatesstatsmd also avail json templatesstatsjson format integr tabl tr td nuclei templat top 10 statist tag count author count directori count sever count type count cve 1156 daffainfo 560 cve 1160 info 1192 http 3187 panel 515 dhiyaneshdk 421 exposedpanel 523 high 874 file 68 lfi 461 pikpikcu 316 vulner 452 medium 662 network 50 xss 367 pdteam 262 technolog 255 critic 414 dn 17 wordpress 364 geeknik 179 exposur 204 low 183 exposur 293 dwisiswant0 168 misconfigur 197 unknown 6 rce 291 princechaddha 133 workflow 186 cve2021 283 0xakoko 130 tokenspray 154 tech 271 gy741 118 defaultlogin 95 wpplugin 264 pussycat0x 116 file 68 261 directori 3543 file td tr tabl document pleas navig http nucleiprojectdiscoveryio detail document build new custom templat also ad set templat help understand thing work contribut nucleitempl power major contribut commun templat contribut http githubcom projectdiscoveri nucleitempl issu newassigneeslabelstemplatesubmittemplatemdtitle5bnucleitemplate5d featur request http githubcom projectdiscoveri nucleitempl issu newassigneeslabelstemplatefeaturerequestmdtitle5bfeature5d bug report http githubcom projectdiscoveri nucleitempl issu newassigneeslabelstemplatebugreportmdtitle5bbug5d welcom alt http repobeatsaxiomco api emb 55ee65543bb9a0f9c797626c4e66d472a517d17csvg repobeat analyt imag discuss question doubt idea discuss feel free open discuss github discuss http githubcom projectdiscoveri nucleitempl discuss board commun welcom join activ discord commun http discordgg projectdiscoveri discuss directli project maintain share thing other around secur autom addit may follow us twitter http twittercom pdnuclei updat thing nuclei p aligncent hrefhttp githubcom projectdiscoveri nucleitempl graph contributor img srchttp contribrock imagerepoprojectdiscoveri nucleitemplatesmax300 p thank contribut keep commun vibrant heart,h1 aligncenter nucleus template h1 h4 aligncentercommunity curated list template nucleus engine find security vulnerability application h4 p aligncenter hrefhttps githubcom projectdiscovery nucleitemplates issuesimg srchttps imgshieldsio badge contributionswelcomebrightgreensvgstyleflat hrefhttps githubcom projectdiscovery nucleitemplates releasesimg srchttps imgshieldsio github release projectdiscovery nucleitemplates hrefhttps twittercom pdnucleiimg srchttps imgshieldsio twitter follow pdnucleisvglogotwitter hrefhttps discordgg projectdiscoveryimg srchttps imgshieldsio discord 695645237418131507svglogodiscord p p aligncenter hrefhttps nucleiprojectdiscoveryio templatingguide documentation hrefcontributionscontributions hrefdiscussiondiscussion hrefcommunitycommunity hrefhttps nucleiprojectdiscoveryio faq template faq hrefhttps discordgg projectdiscoveryjoin discord p template core nucleus scanner http githubcom projectdiscovery nucleus power actual scanning engine repository store house various template scanner provided team well contributed community hope also contribute sending template via pull request github issue http githubcom projectdiscovery nucleitemplates issue newassigneeslabelstemplatesubmittemplatemdtitle5bnucleitemplate5d grow list nucleus template overview overview nucleus template project including statistic unique tag author directory severity type template table contains top ten statistic matrix expanded version available templatesstatsmd also available json templatesstatsjson format integration table tr td nucleus template top 10 statistic tag count author count directory count severity count type count cve 1156 daffainfo 560 cf 1160 info 1192 http 3187 panel 515 dhiyaneshdk 421 exposedpanels 523 high 874 file 68 lfi 461 pikpikcu 316 vulnerability 452 medium 662 network 50 x 367 pdteam 262 technology 255 critical 414 dns 17 wordpress 364 geeknik 179 exposure 204 low 183 exposure 293 dwisiswant0 168 misconfiguration 197 unknown 6 rce 291 princechaddha 133 workflow 186 cve2021 283 0xakoko 130 tokenspray 154 tech 271 gy741 118 defaultlogins 95 wpplugin 264 pussycat0x 116 file 68 261 directory 3543 file td tr table documentation please navigate http nucleiprojectdiscoveryio detailed documentation build new custom template also added set template help understand thing work contribution nucleitemplates powered major contribution community template contribution http githubcom projectdiscovery nucleitemplates issue newassigneeslabelstemplatesubmittemplatemdtitle5bnucleitemplate5d feature request http githubcom projectdiscovery nucleitemplates issue newassigneeslabelstemplatefeaturerequestmdtitle5bfeature5d bug report http githubcom projectdiscovery nucleitemplates issue newassigneeslabelstemplatebugreportmdtitle5bbug5d welcome alt http repobeatsaxiomco api embed 55ee65543bb9a0f9c797626c4e66d472a517d17csvg repobeats analytics image discussion question doubt idea discus feel free open discussion github discussion http githubcom projectdiscovery nucleitemplates discussion board community welcome join active discord community http discordgg projectdiscovery discus directly project maintainer share thing others around security automation additionally may follow u twitter http twittercom pdnuclei updated thing nucleus p aligncenter hrefhttps githubcom projectdiscovery nucleitemplates graph contributor img srchttps contribrocks imagerepoprojectdiscovery nucleitemplatesmax300 p thanks contribution keeping community vibrant heart,Python


Additionally, as mentioned earlier in the overview, we filtered languages so that anything that was not JavaScript, HTML or Python was bucketed as 'Other'.
```
df['language_reduced'] = df.language.apply(lambda lang: lang if lang in ['JavaScript', 'HTML', 'Python'] else 'Other')
```
The counts are as follows:

In [10]:
df.language_reduced.value_counts(normalize=True)

Other         0.660550
JavaScript    0.128440
Python        0.119266
HTML          0.091743
Name: language_reduced, dtype: float64

While in the preparatory stage, we discovered that some languages were listed as None. Additionally, some rows had nothing other than '' as their body. These were removed using
```
df['original'] = df['original'].apply(lambda text: np.nan if text == '' else text)
df = df.dropna()
```
And 

In [28]:
df['original'] = df['original'].apply(lambda text: np.nan if text == '' else text)

## Exploratory Data Analysis

## Modeling