# Data Cleaning & Exploratory Data Analysis
### Some Potential Ideas:
1. Sentiment Analysis with `review` column
2. Using   `recipe` column and feature engineering (length of `recipe`, TF-IDF, ...) to predict `ratings`

In [363]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [318]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from transform import *

In [319]:
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')

In [320]:
interactions.shape

(731927, 5)

In [321]:
recipes.shape

(83782, 12)

## Merge:
1. Left merge the recipes and interactions datasets together.

2. In the merged dataset, fill all ratings of 0 with np.nan. (Think about why this is a reasonable step, and include your justification in your website.)

3. Find the average rating per recipe, as a Series.

4. Add this Series containing the average rating per recipe back to the recipes dataset however you’d like (e.g., by merging). Use the resulting dataset for all of your analysis. (For the purposes of Project 4, the 'review' column in the interactions dataset doesn’t have much use.)

In [322]:
step0 = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id')
# merge on ones that are all in recipes
step0.shape

(234429, 17)

In [323]:
step1 = step0.pipe(initial)

## Data Cleaning & Transformation:
**Analysis**:
1. Some columns, like `nutrition`, contain values that look like lists, but are actually strings that look like lists. You may want to turn the strings into actual lists, or create columns for every unique value in those lists.
    - For instance, per the data dictionary, each value in the 'nutrition' column contains information in the form: "[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV)]"; you could create individual columns in your dataset titled 'calories', 'total fat', etc.
2. Convert to list for `steps`, `ingredients`, and `tags`
3. Convert `date` and `submitted` to Timestamp object and rename as `review_date` and `recipe_date`
4. Drop same `id` (same with `recipe_id`)

**Report**:
Describe, in detail, the data cleaning steps you took and how they affected your analyses. The steps should be explained in reference to the data generating process. Show the head of your cleaned DataFrame (see Part 2: Report for instructions).

In [360]:
step2 = step1.pipe(transform_df)

In [361]:
step2.columns

Index(['name', 'minutes', 'contributor_id', 'recipe_date', 'tags', 'n_steps',
       'steps', 'description', 'ingredients', 'n_ingredients', 'user_id',
       'recipe_id', 'review_date', 'rating', 'review', 'avg_rating',
       'calories', 'total_fat', 'sugar', 'sodium', 'protein', 'sat_fat',
       'carbs'],
      dtype='object')

In [362]:
step2.head()

Unnamed: 0,name,minutes,contributor_id,recipe_date,tags,n_steps,steps,description,ingredients,n_ingredients,...,rating,review,avg_rating,calories,total_fat,sugar,sodium,protein,sat_fat,carbs
0,1 brownies in the world best ever,40,985201,2008-10-27,"[60-minutes-or-less, time-to-make, course, mai...",10,[heat the oven to 350f and arrange the rack in...,"these are the most; chocolatey, moist, rich, d...","[bittersweet chocolate, unsalted butter, eggs,...",9,...,4.0,"These were pretty good, but took forever to ba...",4.0,138.4,10.0,50.0,3.0,3.0,19.0,6.0
1,1 in canada chocolate chip cookies,45,1848091,2011-04-11,"[60-minutes-or-less, time-to-make, cuisine, pr...",12,"[pre-heat oven the 350 degrees f, in a mixing ...",this is the recipe that we use at my school ca...,"[white sugar, brown sugar, salt, margarine, eg...",11,...,5.0,Originally I was gonna cut the recipe in half ...,5.0,595.1,46.0,211.0,22.0,13.0,51.0,26.0
2,412 broccoli casserole,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,...,5.0,This was one of the best broccoli casseroles t...,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
3,412 broccoli casserole,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,...,5.0,I made this for my son's first birthday party ...,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0
4,412 broccoli casserole,40,50969,2008-05-30,"[60-minutes-or-less, time-to-make, course, mai...",6,"[preheat oven to 350 degrees, spray a 2 quart ...",since there are already 411 recipes for brocco...,"[frozen broccoli cuts, cream of chicken soup, ...",9,...,5.0,Loved this. Be sure to completely thaw the br...,5.0,194.8,20.0,6.0,32.0,22.0,36.0,3.0


## Univariate Analysis:
**Analysis**:
Look at the distributions of relevant columns separately by using DataFrame operations and drawing at least two relevant plots.

**Report**:
Embed at least one plotly plot you created in your notebook that displays the distribution of a single column (see Part 2: Report for instructions). Include a 1-2 sentence explanation about your plot, making sure to describe and interpret any trends present. (Your notebook will likely have more visualizations than your website, and that’s fine. Feel free to embed more than one univariate visualization in your website if you’d like, but make sure that each embedded plot is accompanied by a description.)


## Bivariate Analysis:
**Analysis**:
Look at the statistics of pairs of columns to identify possible associations. For instance, you may create scatter plots and plot conditional distributions, or box-plots. You must plot at least two such plots in your notebook. The results of your bivariate analyses will be helpful in identifying interesting hypothesis tests!

**Report**:
Embed at least one plotly plot that displays the relationship between two columns. Include a 1-2 sentence explanation about your plot, making sure to describe and interpret any trends present. (Your notebook will likely have more visualizations than your website, and that’s fine. Feel free to embed more than one bivariate visualization in your website if you’d like, but make sure that each embedded plot is accompanied by a description.)

## Interesting Aggregates:
**Analysis**:
Choose columns to group and pivot by and examine aggregate statistics.

**Report**:
Embed at least one grouped table or pivot table in your website and explain its significance.