# Recitation 3 - Integration and Visualization

In this notebook, we will be using the BBC GoodFoods dataset which will give us some hands-on practice of data integration and visualization. The dataset is created by Arth Talati by scraping https://www.bbc.co.uk/food.



# Importing Packages

One of the reasons why Python is an extremely useful language is due to the wide array of packages avaliable.

These are some of the packages that we will be using within this notebook:
1. `pandas`: the bread and butter package that you will get to know very well throughout this course
2. `seaborn`: simple, yet aesthetic data visualization
3. `plotly`: interactive data visualization
4. `pickle`: serializing and deserializing python object structures

Other notable packages are `numpy` (array manipulation) , `matplotlib` (data visualization), and `scikit-learn` (machine learning)

# 1. Data Import

In [1]:
# Basic imports
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly as py
import plotly.graph_objs as go
import pickle

# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Run the following cell to load the dataset

In [2]:
# Loading the dataset from the google drive
file_path = r"/content/drive/MyDrive/Fall 2023/Penn_CIS 5450/Recitation/bbc_df.data"
with open(file_path, 'rb') as filehandle:
    # read the data as binary data stream
    bbc_df = pickle.load(filehandle)

Let's see what our data looks like

In [3]:
# Displaying first 5 rows of the dataset
bbc_df.head(5)

Unnamed: 0,title,url,author,rating,difficulty,serves,prep_time,cook_time,nutrition,id,cusine,post_dates,diet_types,courses,compact_ingredients,rating_of_100,collections,method,ingredients
0,"Tangy carrot, red cabbage & onion salad",https://www.bbcgoodfood.com/recipes/tangy-carr...,Good Food,1.0,Easy,4.0,900.0,,"{'kcal': '146', 'fat': '7g', 'saturates': '1g'...",95429,Asian,1257033600,"[Vegetarian, Low-salt]","[Side dish, Dinner]","[carrot, red cabbage, red onion, mint, coriand...",89.0,"[Vietnamese, Autumn salad, Grated carrot]","Tip the carrots, cabbage and onions into a bow...","[4 carrots, cut into thin sticks or grated Car..."
1,Griddled flatbreads,https://www.bbcgoodfood.com/recipes/griddled-f...,Mary Cadogan,1.0,Moreeffort,,900.0,300.0,"{'kcal': '117', 'fat': '2g', 'saturates': '0g'...",100977,British,1217545200,"[Vegetarian, Low-salt]","[Side dish, Snack]","[wholemeal flour, white flour, yeast, sugar, o...",84.0,"[Turkish, Griddled, Flatbread, Wholemeal bread]",Tip the flours into a food processor. Add the ...,"[250g strong wholemeal flour, 250g strong whit..."
2,Easy hummus recipe,https://www.bbcgoodfood.com/recipes/easy-hummu...,Sarah Cook,1.0,Easy,,300.0,,"{'kcal': '135', 'fat': '5.1g', 'saturates': '0...",101317,Middle Eastern,1325376000,"[Low-salt, Vegetarian]","[Lunch, Side dish]","[chickpea, tahini paste, garlic clove, Greek y...",86.0,"[Lunchbox, Under 200 cal, Vegetarian picnic, S...",Drain the chickpeas into a sieve set over a bo...,"[1 x 400g can chickpea, don't drain, 1 tbsp ta..."
3,Turkish lamb pilau,https://www.bbcgoodfood.com/recipes/turkish-la...,Good Food,1.0,Easy,4.0,,1800.0,"{'kcal': '584', 'fat': '24g', 'saturates': '9g...",97335,Turkish,1114902000,[Low-salt],"[Dinner, Lunch, Main course, Supper]","[pine nut, olive oil, onion, cinnamon, lamb, b...",73.0,"[Turkish, Slow cooker, Tagine, Easter leftover]",Dry-fry the pine nuts or almonds in a large pa...,"[small handful pine nuts or flaked almonds, 1 ..."
4,"Slow-roast lamb with cinnamon, fennel & citrus",https://www.bbcgoodfood.com/recipes/slow-roast...,Sarah Cook,1.0,Easy,6.0,900.0,15600.0,"{'kcal': '514', 'fat': '32g', 'saturates': '13...",101226,Middle Eastern,1296518400,[Low-salt],"[Dinner, Main course]","[lamb, lemon, olive oil, clear honey, cinnamon...",91.0,"[Dinner party main, Mother's Day, Father's Day...",Put the lamb into a large food bag with all th...,[1 leg of lamb Lamb laamA lamb is a sheep that...


# 2. Data Exploration

Before we can implement algorithms or derive useful insights from the data, we need to understand our data. So let us begin our data exploration!

The describe() function will help us to get some important statistics like mean, std, min, max etc. to undertand the distribution of our data.

In [None]:
#TODO: Get descriptive statistics and distributions for each column


We can use dtypes to get the datatype of each column in the dataframe.

In [None]:
#TODO: Inspect the types of each column in a dataframe


The is.na() function is used to find out if there are null values in the dataset. This will be useful when we deal with real world data and need to handle missing values. The following code will provide the null value distribution across columns

In [None]:
# TODO: Find the total number of null values in each column


Now, let's plot some histograms to get a better sense of our data!

Run the following cell to plot a histogram for the column `rating` in the dataset. This will give us a visual representation of the disribution of values in `rating`

In [None]:
#TODO: Plotting a histogram using the values of the column `rating`


In [None]:
#TODO: Plot a histogram for the column `serves`


In [None]:
#TODO: Plot a histogram for the column `prep_time`. Also specify the number of bins as 10


### Dealing with Lists and List comprehension

Lists are a built-in datatype to store colections of data in Python. While for loops can be used to perform functions on a list, in this section, we will explore list comprehension. This is much faster and more computationally efficient as compared to for loops

IMPORTANT: We highly encourage you to familiarize yourself with list comprehension as it is an essential tool for this course

Let's start with a few examples -

In [None]:
# List Comprehension Example:

#1D list


# Using for loops



In [None]:
# 2D List


Back to our data! Let's inspect the values we have in our column `diet_types`


In [None]:
# Looking at the values in the column `diet_types`


We can see that the data in columns is in the form of lists. Now, let's use list comprehension in our dataset to get a set of unique values from the column `diet_types`

In [None]:
# TODO: Replacing nan with NO_TAG


In [None]:
# TODO: Flatten the list of lists to a single list

# TODO: Set function takes the unique elements from the list


# Data Manipulation

In the following section, we will once again review some functions that will be important for Homework 1, specifically:

1. `apply()`
2. `merge()`
3. `group_by()` and `reset_index()`

### 1. Apply

Taking a look at the structure of the nutrition column, we see that it is in the form of key-value pairs

In [None]:
# TODO: Inspect a sample observation of `nutrition` column


Let's use apply() on our bbc_df to split the nutrition column into individual columns based on keys

Note: apply() takes in an axis parameter which details axis to apply the function on. axis=0 will act on all rows in each column, while axis=1 will act on all columns in each row.

Now, we will convert the `nutrition` column into a dataframe by applying a series function `pd.Series` to the column!

In [None]:
# TODO: Make a new df only for data related to nutrition


### Seconds to minutes

In [None]:
# TODO: Cast the column `prep_time` as float


In [None]:
# TODO: Convert column to timedelta format - >  %H:%M:%S


In [None]:
# TODO: Check the datatype of columns after conversion


### 2. Merge
Merge is used to join two dataframes with a common id variable. It is also really useful to see intersections between two dataframes - think Venn Diagrams.

What would happen if we had data in separate dataframes?

Let us consider that we have two dataframes which contain the following information:
1. cuisines_df: contains the id, title and cuisine
2. ratings_df: contains the id and ratings

In [None]:
# TODO: Create cuisines_df which has the columns id, title and cusine


In [None]:
# TODO: Create ratings_df which has the columns id, rating_of_100



Which are the top dishes (rating wise) and what are their ratings?

In [None]:
# TODO: Merge the dataframes cuisines_df and ratings_df to find which dishes have the top ratings


Try setting how=left and how=right

What do you notice?

In [None]:
# TODO: Implement the above merge operation by setting how=left (call it left_cuisines_df) and how=right (call it right_cuisines_df)


### 3. Aggregation

In order to get a broader picture of our data, we often want to aggregate values with classic summary statistics such as mean, median, and count.
This is especially important when we want to compare different groups within our data.

Now that we have a top_cuisines_df that consists of all recipe titles, cuisines and their respective ratings, an interesting metric to look at would be the rating of each cuisine

In order to do this in pandas, we must first call the groupby() function and specify the group we want to aggregate on (in this case cuisine) and then call the specific summary statistic function.

For example, if we wanted to find the mean of each group we would do the following:

df.group_by(group).mean()

Let's create a new df, agg_cusines_df that will group our top_cuisines_df by the field cuisine and calculate the mean

Notice our use of reset_index()

In [None]:
#TODO: Get the average ratings of all cuisines


Now let's try the same question using pandasql

pandasql allows you to query pandas DataFrames using SQL syntax

In [None]:
!pip install pandasql
!pip install sqlalchemy==1.4.46
import pandasql as ps #SQL on Pandas Dataframe

In [None]:
cuisines_df

In [None]:
# TODO: Write a SQL query to get the average rating of all cuisines
bbc_rating_df = bbc_df[['id', 'rating_of_100']]
cus_query = """
"""
rating_df = ps.sqldf(cus_query, locals())
rating_df

# Data Visualization

Now, that we have sufficiently explored our data, let's start creating some visualizations!. We will specifically be using `seaborn` and `plotly`.

- [seaborn](https://seaborn.pydata.org/) creates relatively aesthetic visualizatons with minimal effort. I like using seaborn for simple, yet informative graphs like barplots, boxplots and line charts and. since the syntax is very straightforward and easy to underst
- [plotly](https://plot.ly/python/) is my favorite Python visualization tool as it creates *interactive* visualizations. There are many predefined graph templates avaliable on the website that can be used to create complex visualizations.

Throughout this section, remember the following data visualization steps:

1. Look at your data
2. Identify the message and its components
3. Select your chart
4. Refine

In [None]:
# Necessary imports
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly as py
import plotly.graph_objs as go

## Data Visualization with Seaborn
Using seaborn, let’s answer some simple questions:
### 1. What is the most common level of difficulty within our set of recipes?
To answer this question, the most fitting graph would be a countplot. Let’s use `bbc_df`, let’s make a countplot of the column `difficulty`.

In [None]:
#TODO: Countplot of the column `difficulty`



### 2. How does `prep_time` vary by difficulty?

Because `prep_time` is a quantitative variable, it will have numeric properties that relate to its distribution (mean, median, spread, etc.) Thus, things like scatter plots, box plots would be great ways to look at the data.

Another effective plot would be a violin plot, which overlays the shape of the overall distribution on top of a standard boxplot.

Using the `bbc_df` dataframe that contains all of our data from the dataset, let's compare difficulty scores:

In [None]:
#TODO: violin plot with prep_time and difficulty


## Data Visualization with Plotly
With plotly, let’s try to create more layered and complex graphs. These can also be created in seaborn, but plotly’s biggest advantages is that you can interact with the graphs in various ways including zooming, isolating variables, and hovering tool tips. This means that you can afford to iniitally create more complicated visuals as you will have the opportunity to interact and adjust the graph later on.


###1. Grouped Bar Plots
One way to “spice” up visualizations is using color to further categorize information. Let’s take advantage of this to compare the different types of genres within our playlists.

In [None]:
#TODO: Make a plotly bar plot with prep_time and cuisine


### 2. 3D Scatter Plots

plotly has a ton of interesting visualization tools, but just because they are interesting, doesn't mean that they are actually meaningful. Would you argue that the 3D scatter plot below is a helpful visual?

In [None]:
#TODO: Make a plotly 3d plot with cook_time, prep_time and cuisine


### 3. Side-by-Side Violin Plot

We can manipulate our classic violin plot into a side-by-side violin plot to compare distributions of difficulty across different cuisines. What are the pros of this visualization? Cons?

In [None]:
#TODO: Make a plotly side by side violin plot with difficulty and cuisine as main variables

### 4. Radial Plots

Radial plots, or spider plots is another way to visualize the magnitudes of data. What are the pros of this visualization? Cons?

In [None]:
#TODO: Make a radial plot using cuisine and difficulty


##5. Word Clouds

Word Clouds are a great way to view text data. We can use this to visualize the main instructions within the instructions for our recipes.

In [None]:
from wordcloud import WordCloud

In [None]:
#TODO: Word Cloud on 'method' of bbc_df


In [None]:
#TODO: Word Cloud on 'title' of bbc_df

#plt.subplots(figsize = (26, 12))



# Conclusion

Data Wrangling and Visualization are the bread and butter of data analysis. This notebook only covers a small slice of things that can be done in Python, which itself is only one tool among hundreds avaliable for this task.

We hope this notebook was helpful to you, and please reach out to us if you have any questions!

Thanks everyone :)