# Module 3 Python Practice 

**Let's see some equivalents of the R practice activities in Python using plotnine library. The ggplot code is almost the same except some syntax differences for Python.** 

In [None]:
from plotnine import *
import pandas as pd

In [None]:
tips_data =pd.read_csv("/dsa/data/all_datasets/tips.txt")

In [None]:
tips_data.head()

**Activity 1:** Plot a bar chart where the height is the number of tips per day.

In [None]:
ggplot(tips_data, aes(x="day")) + geom_bar( stat ="count" )

**Activity 2:** Plot a bar chart where the height of the bar shows the total amount of tips per day.

In [None]:
ggplot(tips_data, aes(x="day", weight="tip" )) + geom_bar()

**Activity 3:** Plot a bar chart where the height of the bar shows the total amount of tips per day.

In [None]:
ggplot(tips_data, aes(x="day", weight = "tip")) + geom_bar(aes(fill="sex"))

This was a stacked bar chart and it does not provide a good comparison between two categories; we know that aligned bars give better visual comparison. So let's change this:

In [None]:
p = ggplot(tips_data, aes(x="day", weight = "tip")) + geom_bar(aes(fill="sex"), position= "dodge", colour="black")
p

This is better. So we can see Fridays are not a good day for waiters according to this data set. Let's change the colors by adding scales.

In [None]:
p + scale_fill_manual(values=["red", "blue"])

In [None]:
p + scale_fill_brewer(type = "qual", palette="Accent") + scale_x_discrete(limits = ["Thur", "Fri", "Sat", "Sun"])

**Activity 4:** Plot a scatter plot of tips vs. total bill using sex and smoker status as facets.

In [None]:
ggplot(tips_data, aes(x="total_bill", y="tip")) + geom_point() + facet_grid(["sex", "smoker"])

For the fitted curve, we need the scikit-misc package; we can install it within the notebook like following: 

In [None]:
!pip install scikit-misc

In [None]:
ggplot(tips_data, aes(x="total_bill", y="tip")) + geom_point() + facet_grid(["sex", "smoker"]) + \
geom_smooth(method="loess", colour="blue")

**Activity 5:** Draw a scatter plot for variables total_bill and tip using sex and smoker as facets. Map the 'day' and 'size' attributes to color and shape visual variables, respectively.

In [None]:
p = ggplot(tips_data) + geom_point(aes(x="total_bill", y="tip", color = "day", shape="size"))\
+ facet_grid(["sex", "smoker"])
p

Notice that size isn't shown on the legend. This is because it is still being recognized as an integer. You need it to be a string.

In [None]:
tips_data["size"] = tips_data["size"].apply(lambda x: str(x))
p = ggplot(tips_data) + geom_point(aes(x="total_bill", y="tip", color = "day", shape="size"))\
+ facet_grid(["sex", "smoker"])
p

Now, let's see how we can plot a *box and whiskers plot* to visualize the summary statistics of a data set. Let's use the diamonds data set for it; and plot color vs. price for each clarity category.

In [None]:
from plotnine.data import diamonds
p3 = ggplot(diamonds, aes(x="color", y="price")) + geom_boxplot() + facet_wrap(["clarity"])
p3

Thick black line is the median, the edges of the box show the 25th and 75th quantiles, and the dots are the outliers. Because there are many outliers with very high values, scaling the y-axis with a log scale might be helpful. 

In [None]:
p3 + scale_y_log10()

This plot shows the statistics better, but it doesn't show the details of the distribution. We can use a violin plot to see the density.

In [None]:
ggplot(diamonds, aes(x="color", y="price")) + geom_violin()+ facet_wrap(["clarity"]) + scale_y_log10()

Now we can see the distribution as the width at each point in this plot represents the frequency of the corresponding price.

In [None]:
ggplot(diamonds, aes(x="color", y="price")) + geom_violin(aes(fill="color")) + facet_wrap(["clarity"]) + scale_y_log10()