# Pandas Basics Practice

Now it's your turn - Let's start practicing working with Pandas!  

You'll walk through instantiating a `DataFrame`, reading data into it, looking at and examining that data, and then playing with it. 


A dataset on the [quality of red wines](https://archive.ics.uci.edu/ml/datasets/wine+quality) is used for this purpose.
It is lokated in the `data` folder within this directory. It's called `winequality-red.csv`. 


Typically, we use Jupyter notebooks like this for a very specific set of things - presentations and EDA (Exploratory Data Analysis). 


Today, as we'll be playing around with `Pandas`, much of what we'll be doing is considered EDA. Therefore, by using a notebook, we'll get a tighter feedback loop with our work than we would trying to write a script. But, in general, **we do not use Jupyter notebooks for development**. 

Below, we've put a set of questions and then a cell for you to work on answers. However, feel free to add additional cells if you'd like. Often it will make sense to use more than one cell for your answers. 



# Assignment Questions 

### Part 1 - The Basics of DataFrames

Let's start off by following the general workflow that we use when moving data into a DataFrame: 

    * Importing Pandas
    * Reading data into the DataFrame
    * Getting a general sense of the data

So, in terms of what you should do for this part...

1. Import pandas
2. Read the wine data into a DataFrame. 
3. Use the `attributes` and `methods` available on DataFrames to answer the following questions: 
    * How many rows and columns are in the DataFrame?
    * What data type is in each column?
    * Are all of the variables continuous, or are any categorical?
    * How many non-null values are in each column?
    * What are the min, mean, max, median for all numeric columns?

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../data/winequality-red.csv', delimiter=';')

In [None]:
df.head()

In [None]:
#How many rows and columns are in the DataFrame?
df.shape

In [None]:
#What data type is in each column?
df.info()

In [None]:
#Are all of the variables continuous, or are any categorical?
#checking uniques in all columns - indicates, quality is a categorical measure

for col in df.columns:
    print(f'- {col}: {df[col].nunique()}')

### Part 2 - Practice with Grabbing Data

Let's now get some practice with grabbing certain parts of the data. If you'd like some extra practice, try answering each of the questions in more than one way (because remember, we can often grab our data in a couple of different ways). 

1. Grab the first 10 rows of the `chlorides` column. 
2. Grab the last 10 rows of the `chlorides` column. 
3. Grab indices 264-282 of the `chlorides` **and** `density` columns. 
4. Grab all rows where the `chlorides` value is less than 0.10. 
5. Now grab all the rows where the `chlorides` value is greater than the column's mean (try **not** to use a hard-coded value for the mean, but instead a method).
6. Grab all those rows where the `pH` is greater than 3.0 and less than 3.5. 
7. Further filter the results from 6 to grab only those rows that have a `residual sugar` less than 2.0. 

In [None]:
#1. Grab the first 10 rows of the `chlorides` column. 
#2. Grab the last 10 rows of the `chlorides` column. 

df.chlorides.head(10)
df.chlorides.tail(10)

In [None]:
#3. Grab indices 264-282 of the `chlorides` and `density` columns.

df.loc[264:282, ['chlorides', 'density']]

In [None]:
#4. Grab all rows where the `chlorides` value is less than 0.10.

df[df.chlorides <.1]

In [None]:
#5. Now grab all the rows where the `chlorides` value is greater than the column's mean
#(try not to use a hard-coded value for the mean, but instead a method.)

df.query('chlorides > chlorides.mean()')

In [None]:
#6. Grab all those rows where the `pH` is greater than 3.0 and less than 3.5. 

df.query('pH > 3.0 and pH < 3.5')

In [None]:
#7. Further filter the results from 6 to grab only those rows that have a `residual sugar` less than 2.0.
#Tip: Use backticks (``) to mask column names with spaces in query string.

df.query('pH > 3.0 and pH < 3.5 and `residual sugar` < 2.0')

### Part 3 - More Practice

Let's move on to some more complicated things. Use your knowledge of `groupby`s, `sorting` to answer the following. 

1. Get the average amount of `chlorides` for each `quality` value. 
2. For observations with a `pH` greater than 3.0 and less than 4.0, find the average `alcohol` value by `pH`. 
3. For observations with an `alcohol` value between 9.25 and 9.5, find the highest amount of `residual sugar`. 
4. Create a new column, called `total_acidity`, that is the sum of `fixed acidity` and `volatile acidity`. 
5. Find the average `total_acidity` for each of the `quality` values. 
6. Find the top 5 `density` values. 
7. Find the 10 lowest `sulphates` values. 

In [None]:
#1. Get the average amount of `chlorides` for each `quality` value.

df.groupby('quality').mean()['chlorides'].reset_index()

In [None]:
#2. For observations with a `pH` greater than 3.0 and less than 4.0, find the average `alcohol` value by `pH`. 

df.query('pH > 3.0 and pH < 4.0').groupby('pH').mean().alcohol.reset_index()

In [None]:
#3. For observations with an `alcohol` value between 9.25 and 9.5, find the highest amount of `residual sugar`. 

df.query('alcohol > 9.25 and alcohol < 9.5').max()['residual sugar']

In [None]:
#4. Create a new column, called `total_acidity`, that is the sum of `fixed acidity` and `volatile acidity`. 

df.eval('total_acidity = `fixed acidity` + `volatile acidity`', inplace=True)
df

In [None]:
# 5. Find the average `total_acidity` for each of the `quality` values. 

df.groupby('quality').mean().total_acidity.reset_index()

In [None]:
#5 and following can be solved by pivoting - we'll come back later to this. Here just as an example:
 
import numpy as np

table = pd.pivot_table(df, index=['quality'],
        values=['total_acidity'],
        aggfunc={'total_acidity' : np.mean}).reset_index()
table

In [None]:
# 6. Find the top 5 `density` values.

df.sort_values('density', ascending=False).head(5)['density'].reset_index()

In [None]:
# 7. Find the 10 lowest `sulphates` values. 

df.sort_values('sulphates').head(10)['sulphates'].reset_index()