![DSL_logo](https://github.com/BrockDSL/Intro_to_Python_Workshop/blob/master/dsl_logo.png?raw=1) 


# Data Science with Python!

Welcome to the Digital Scholarship Lab Level 3 Python workshop. 

In Python 2.0, we learned about the Pandas library

What we'll learn today is:
- plotting data with matplotlib
- Creating correlational graphs with Seaborn

We'll be using Python as a Data Analysis tool

# Another Library, MatplotLib

Let's take a look at graphing our results. We can use the `matplotlib` library to generate some graphs of our results. We always gives lists as parameters for the graphs

First, import pandas library 

In [None]:
#Load the Library Pandas, that works with data
import pandas as pd

In [None]:
#This line is for Jupyter's benefit
%matplotlib inline
#Import MayPlotLib to graph some results
import matplotlib.pyplot as plt

Let's reload our data into a new dataframe

In [None]:
#Load the file
graph_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

#Tell it what our columns are
graph_data.columns = ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality"]


## Pie Graphs
Let's draw a pie graph of the number of wine samples that were rated a quality less than 5 versus wine that was rated greater than 5.

In [None]:
#All of the Poor Quality Wine
Total_LessThan5 = graph_data[graph_data["quality"] < 5]["quality"].count()
print("Poor Quality Wine: " + str(Total_LessThan5))

#All the Good Quality Wine
Total_GreaterThan5 = graph_data[graph_data['quality'] > 5]["quality"].count()
print("Good Quality Wine: "+ str(Total_GreaterThan5))

#All the wine with a wine quality of 5
Total_EqualTo5 = graph_data[graph_data["quality"] == 5]["quality"].count()
print("Medium Quality Wine: "+ str(Total_EqualTo5))

# Matplot lib always wants data in a list, so we'll make one
pie_data = [Total_LessThan5,Total_EqualTo5,Total_GreaterThan5]
pie_labels = ["LessThan5", "TotalEqualTo5", "GreaterThan5"]
plt.pie(pie_data,labels=pie_labels, colors=('red','black','grey'))

plt.show()

### How to create a donut chart

In [None]:
#All of the Poor Quality Wine
Total_LessThan5 = graph_data[graph_data["quality"] < 5]["quality"].count()
print("Poor Quality Wine: " + str(Total_LessThan5))

#All the Good Quality Wine
Total_GreaterThan5 = graph_data[graph_data['quality'] > 5]["quality"].count()
print("Good Quality Wine: "+ str(Total_GreaterThan5))

#All the wine with a wine quality of 5
Total_EqualTo5 = graph_data[graph_data["quality"] == 5]["quality"].count()
print("Medium Quality Wine: "+ str(Total_EqualTo5))

# Matplot lib always wants data in a list, so we'll make one
pie_data = [Total_LessThan5,Total_EqualTo5,Total_GreaterThan5]
pie_labels = ["LessThan5", "TotalEqualTo5", "GreaterThan5"]
plt.pie(pie_data,labels=pie_labels)

# Add a circle to create a hole in the pie chart
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)


plt.show()

Try questions Q1  - Q2 and type "Completed" in the chat when you're done.

- **Q1** Can you create a pie graph that shows the citric acid distribution in the data? You just need to modify line 2 & 6

In [None]:
#Fill in the following
CitricAcidEqualto0 = graph_data[graph_data["ChangeMe"] == 0]["ChangeMe"].count()
print("Citric Acid Equal To 0: "+ str(CitricAcidEqualto0))

#Fill in the following
CitricAcidGreaterthan0 = graph_data[graph_data["ChangeMe"] > 0 ]["ChangeMe"].count()
print("Citric Acid Greater than 0: "+ str(CitricAcidGreaterthan0))

pie_data = [CitricAcidEqualto0,CitricAcidGreaterthan0]
pie_labels = ["CitricAcidEqualto0","CitricAcidGreaterThan0"]
plt.pie(pie_data,labels=pie_labels)

plt.show()

- **Q2** Can you create a donut chart that shows the how many red wine samples in the dataset have a fixed acidity level over 10, under 10, and equal to 10? You just need to modify line 2, 6 & 9, as well as copy and paste the code for the circle in the middle.

In [None]:
#Fill in the following
over_10 = graph_data[graph_data["ChangeMe"] > 10]["ChangeMe"].count() 
print("Fixed Acidity greater than 10: "+ str(over_10))

#Fill in the following
under_10 = graph_data[graph_data["ChangeMe"] < 10]["ChangeMe"].count()
print("Fixed Acidity less than 10: "+ str(under_10))

equalto_10 = graph_data[graph_data["ChangeMe"] == 10]["ChangeMe"].count()
print("Fixed Acidity equal to 10: "+ str(equalto_10))

pie_data = [over_10, under_10, equalto_10]
pie_labels = ["Over 10","Under 10","equal to 10"]
plt.pie(pie_data,labels=pie_labels, colors = ("red", "pink", "blue"))
 
#Copy and paste the code that creates the circle in the middle



plt.show()

## Automatic Histograms


Say we wanted to plot out the alcohol distribution of our data set as a [histogram](https://en.wikipedia.org/wiki/Histogram) 

In [None]:
# bins is the number of containers we'll split our x-axis values into
bins = 6

plt.hist(graph_data["alcohol"],bins, color=('red'), alpha=(0.9), hatch="x", edgecolor='white')

plt.title("Alcohol distribution", color=(0.2,0.6,0.4,0.6), size=30)
plt.xlabel("Alcohol Amount", size=20)
plt.ylabel("Occurrences", size=20)

#Set Background colour
plt.gca().set_facecolor('lightblue')
plt.gca().set_axis_on()

#Change the color of the x and y values
ax = plt.gca()
ax.tick_params(axis='x', colors='brown')
ax.tick_params(axis='y', colors='green')

plt.show()

Try Q3 below and type "All done!" in the chat when you're done

**Q3** Can you draw a histogram of the `pH` distribution? Make sure to give it the axes good descriptions. You just need to modify line 1,5, & 6. (The example above should help you)

In [None]:
bins = #FILL

plt.hist(graph_data["pH"],bins) 
plt.title("pH Distribution")
plt.xlabel("change me") #FILL
plt.ylabel("change me") #FILL

plt.show()

### How to create a Venn Diagram

First, install the venn diagram from matplotlib

In [None]:
pip install matplotlib-venn 

Next, import the venn diagram from Matplotlib, and enter the values and names

In [None]:
from matplotlib_venn import venn2
 
# depict venn diagram
v = venn2(subsets = (7,50,10), set_colors = ("green", "red"), set_labels = ('MatPlotLib', 'Seaborn'))

# Optional Step that sets the title, font, and size
plt.title("Similarities and Differences between MatPlotLib and Seaborn", color=(0.2,0.1,0.4,0.6), size=20)

# Optional step that allows you to add texture to a circle
p = v.get_patch_by_id('11')
p.set_hatch('//')
p.set_edgecolor('black')

p = v.get_patch_by_id('10')
p.set_hatch('*')
p.set_edgecolor('blue')

#Optional step that sets the background color
plt.gca().set_facecolor('grey')
plt.gca().set_axis_on()

plt.show()

Try question 4 and type finished in the chat when you are done.

**Q4** Create a venn diagram that visualizes these variables we created earlier in the pie chart example: 
- Total_LessThan5
- Total_GreaterThan5
- Total_EqualTo5

In [None]:
venn2(subsets = (ChangeMe,ChangeMe,ChangeMe), set_labels = ('ChangeMe', 'ChangeMe'))

## Another Library: Seaborn

Seaborn can do the same charts as Matplotlib, along with correlational charts that visualize correlations between variables.

Install Seaborn

In [None]:
pip install seaborn



Import Seaborn library, and ensure that numpy, matplotlib, and pandas are also imported

In [None]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns 

Create a visual graph showing the correlation between each wine property 

In [None]:
#Import the red wine data, and use df as the variable name.
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';') 

#Set the size of the chart
f, ax = plt.subplots(figsize=(10, 8)) 

#Create a variable representing the dataframe and the correlation function, to produce the results as a list
corr=df.corr() 

#Prints the variable created above in list form before the actual chart.
print(corr)

#Create the heatmap chart
sns.heatmap(corr,annot=True, mask=np.zeros_like(corr, dtype=bool), 
        cmap=sns.diverging_palette(899, 87, as_cmap=True), 
                    square=True, ax=ax) 

#Create a title for the chart
ax.set_title("Red Wine Samples")

plt.show()

A similar way of writing the code, along with slanted labels

In [None]:
corr = df.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
#Set the x-axis labels to be written on a slant.
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=35,
    horizontalalignment='right'
);

**Q5** Create the same graph for the white wine samples. Just display the graph, with the co-efficients written on each colored square, not the list of correlations. Present the horizontal axis labels straight. 

In [None]:
#Alter the web address to import the white wine samples, by changing red to white. 
df2 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

#Set the size of your graph
f, ax = plt.subplots(figsize=(10, 8))

#Alter the code so that it creates a heatmap for the white wine dataset, and change the annot to display the coefficients on the chart
corr = ChangeMe.corr()
ax = sns.heatmap(
    corr,annot=ChangeMe, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True)

#### Create a Seaborn joint plot to show the relationship between two variables: "fixed acidity" and "citric acid"

First, list the column names

In [None]:
df.columns = ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality"] 

In [None]:
#set up the joint plot
sns.jointplot(x="fixed acidity", y="citric acid", data=df, kind="reg", color="cyan");

#Create an arrow shape and a label to point at the chart
plt.annotate('Notice something?', xy=(9, 1.002), xytext=(4, 1.004), arrowprops={'facecolor':'grey', 'shrink':0.05})

**Q6** Create a Seaborn joint plot to show the relationship between fixed acidity and pH

In [None]:
#set up the joint plot


### Violin Chart

Compares the distributions of two variables side by side

Create a violin plot to compare distribution between two variables

In [None]:
#Create a violin chart
sns.violinplot("quality", "pH", data=df,
               palette=["teal", "lightblue", "beige", "lightgrey", "lightgreen"]);

#Create a title
plt.title('Comparing the Distribution of Quality and pH')


**Q7** Create a violin plot to compare the distribution between quality and total sulfur dioxide

In [None]:
#Create a violin chart


# Practice Writing SQL queries online

Go to: [bit.io](https://bit.io/laritzrp/WineSamples)

Once you're done, type finished in the chatbox.

Example 1: Calculate the total number of redwine samples.

Solution - In the query editor, type the following:
```
SELECT count(redwine.quality) 
FROM "laritzrp/WineSamples"."redwine"; 
```
The query result shows that there are 1599 redwine samples.


**Q8** Copy and paste this template into the query editor. Modify the code to find the total number of white wine samples.
```
SELECT count(redwine.quality) 
FROM "laritzrp/WineSamples"."redwine"; 
```

**Q9** Copy and paste this template into the query editor. Modify the code to find the min and max pH amounts for the red wine samples
```
SELECT min(redwine.alcohol) AS "Minimum Alcohol Amount", max(redwine.alcohol) AS "Maximum Alcohol Amount"
FROM "laritzrp/WineSamples"."redwine";
```

#### Using the Groupby and Having statement, and sorting in ascending or descending order

Find the average red wine quality amounts for each different alcohol level found in the dataset. Only display the averages for alcohol amounts greater than or equal to 13. Order the values by alcohol amounts, in ascending order.
```
SELECT (redwine.alcohol) AS "Alcohol Amount", count(redwine.alcohol) AS "Number of Red Wine Samples", avg(redwine.quality) AS "AVG of Quality" FROM "laritzrp/WineSamples"."redwine" 
Group by redwine.alcohol, redwine.alcohol HAVING ((redwine.alcohol)>=13) 
Order By (redwine.alcohol) ASC;
```

**Q10** Copy and paste this template into the query editor. Modify the code to find the total number of "free sulfur dioxide" levels greater than or equal to 30. Show the results in descending order. 
```
SELECT (redwine.alcohol) AS "Alcohol Amount", count(redwine.alcohol) AS "Number of Red Wine Samples", avg(redwine.quality) AS "AVG of Quality"
FROM "laritzrp/WineSamples"."redwine" 
Group by redwine.alcohol, redwine.alcohol
HAVING ((redwine.alcohol)>=13)
Order By (redwine.alcohol) ASC;
```

Example using the WHERE statement

List all of the citric acid amounts for the red wine samples rated a quality level of 8.

```
SELECT (redwine."citric acid") 
FROM "laritzrp/WineSamples"."redwine"
WHERE ((redwine.quality)=8); 
```
Selecting data from numberous tables:

From the same column

List the quality levels for both the red and white wine samples
```
SELECT (redwine.quality) AS "red wine Quality", (whitewine.quality) AS "White Wine Quality"
FROM "laritzrp/WineSamples"."redwine", "laritzrp/WineSamples"."whitewine";  
```
From different columns

List the citric acid levels of the red wine samples, and the pH levels of the white wine samples
```
SELECT (redwine."citric acid") AS "Red Wine Citric Acid", (whitewine.pH) AS "White Wine pH"
FROM "laritzrp/WineSamples"."redwine", "laritzrp/WineSamples"."whitewine"; 
```

Please do questions 11 and 12, and type done in the chat when you are done.

**Q11** Create your own SQL statement that counts the number of white wine samples with a quality rating of 8.

**Q12** Create your own SQL statement that counts the number of red wine samples with a quality rating of 8.