# Michelin NLP Capstone Project 
### Presented by: Yuvia Cardenas, Justin Evans, Cristina Lucin, and Woody Sims

## Project Overview

This project focuses on building a prediction model for accurately predicting the coding language of a project using examination of GitHub repo Readme files. Our goal is to develop a predictive model utilizing Python and Python libraries and select the most effective model for production. Initially, we are utilizing BeautifulSoup to acquire our data, selecting 1000 repositories tagged with 'Minecraft' from GitHub, taking in all Readme text and repo language information from each repo. After gathering the data, we explore the data through questions and visualizations before developing a model that can tell us: "What language is this repository most likely to be written in?"

## Goals

### Create deliverables:
- READ ME
- Final Report (Jupyter Notebook)
- Functional acquire.py, explore.py, and model.py files
- Acquire data from Michelin Website utilizing Beautiful Soup to scrape restaurant review text
- Prepare and split the data
- Explore the data and produce visualizations encapsulating exploration
- Establish and document baseline
- Fit and train a classification model to predict the programming language of the Repo
- Evaluate the model by comparing its performance on train utilizing accuracy as a measure
- Evaluate the selected model on test data
- Develop and document findings, takeaways, recommendations, and next steps

In [2]:
# Imports

import pandas as pd
import numpy as np
import re
import os

# Webscraping/NLP
import requests
from requests import get
from bs4 import BeautifulSoup
import time
import nltk
import requests
import unicodedata
from nltk.corpus import stopwords

# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# Stats
import scipy.stats as stats
from scipy.stats import ttest_ind, levene, f_oneway

# Team Imports

import prepare as p
import acquire as a
import explore as e
import model as m
from importlib import reload


# Acquire
- 6,780 Michelin website page URLs were acquired from a Michelin Kaggle Dataset of the current, quarterly Michelin Restaurants in the world (Kaggle Data Acquired 1/18/2023)
- These website pages were scraped utilizing BeautifulSoup utilizing a function called "get_michelin_pages"
- The text from the restaurant review was appended to the original Michelin Dataframe as a column titled "data"
- This dataframe included 6,780 rows before cleaning
- Each row represents a unique restaurant awarded and currently possessing a Michelin guide award designation
- Each column represents a feature of the restaurant, such as name, location, cuisine type, and price level

# Prepare
### Prepare Actions:
- Removed columns not necessary for project goals
- Dropped rows "restaurants" who no longer appear on the Michelin Guide Website (No longer a current Michelin Awardee)
- Recasted columns into different data types as appropriate
- Checked for null values in the data, imputed null values where applicable
- Utilized Regex and string methods and functions to clean restaurant review text
- 