# Data Manipulation Studio

For this studio, we will revisit the data set from our last studio. If you recall, California farmers were looking for advice on growing pumpkins. We will use the same [pumpkins dataset](https://www.kaggle.com/usda/a-year-of-pumpkin-prices) as provided by the U.S. Department of Agriculture. You may have to clean data in the process of data manipulation, so feel free to pull up your notebook from the last class's studio.

We will now be focusing our attention on a different region in the United States, the Northeast. When you open up the `dataset` folder, you will have 13 CSVs, including the San Francisco and Los Angeles data from the last lesson. The 13 CSVs are each a different terminal market in the United States.

A **terminal market** is a central site, often in a metropolitan area, that serves as an assembly and trading place for commodities. Terminal markets for agricultural commodities are usually at or near major transportation hubs. [Definition Source](https://en.wikipedia.org/wiki/Terminal_market#:~:text=A%20terminal%20market%20is%20a,or%20near%20major%20transportation%20hubs)

## Getting Started

Import the CSVs for each of the following cities: Baltimore, Boston, New York, and Philadelphia. Set up a dataframe for each city.

In [8]:
# Import the necessary libraries and CSVs. Make some dataframes!
import pandas as pd
import matplotlib 
import matplotlib.pyplot as plt
import numpy as np

df_Baltimore = pd.read_csv("/Users/miafusco/Documents/LaunchCode/data-analysis-projects2/data-analysis-projects-class-19-and-20/class-18/studio/dataset/baltimore_9-24-2016_9-30-2017.csv")
df_Boston = pd.read_csv("/Users/miafusco/Documents/LaunchCode/data-analysis-projects2/data-analysis-projects-class-19-and-20/class-18/studio/dataset/boston_9-24-2016_9-30-2017.csv")
df_New_York = pd.read_csv("/Users/miafusco/Documents/LaunchCode/data-analysis-projects2/data-analysis-projects-class-19-and-20/class-18/studio/dataset/new-york_9-24-2016_9-30-2017.csv")
df_Philadelphia = pd.read_csv("/Users/miafusco/Documents/LaunchCode/data-analysis-projects2/data-analysis-projects-class-19-and-20/class-18/studio/dataset/philadelphia_9-24-2016_9-30-2017.csv")

print("Baltimore Data")
print(df_Baltimore.head(10))

print("Boston Data")
print(df_Boston.head(10))

print("New York Data")
print(df_New_York.head(10))

print("Philadelphia Data")
print(df_Philadelphia.head(10))

Baltimore Data
  Commodity Name  City Name  Type       Package      Variety Sub Variety  \
0       PUMPKINS  BALTIMORE   NaN  24 inch bins          NaN         NaN   
1       PUMPKINS  BALTIMORE   NaN  24 inch bins          NaN         NaN   
2       PUMPKINS  BALTIMORE   NaN  24 inch bins  HOWDEN TYPE         NaN   
3       PUMPKINS  BALTIMORE   NaN  24 inch bins  HOWDEN TYPE         NaN   
4       PUMPKINS  BALTIMORE   NaN  24 inch bins  HOWDEN TYPE         NaN   
5       PUMPKINS  BALTIMORE   NaN  24 inch bins  HOWDEN TYPE         NaN   
6       PUMPKINS  BALTIMORE   NaN  36 inch bins  HOWDEN TYPE         NaN   
7       PUMPKINS  BALTIMORE   NaN  36 inch bins  HOWDEN TYPE         NaN   
8       PUMPKINS  BALTIMORE   NaN  36 inch bins  HOWDEN TYPE         NaN   
9       PUMPKINS  BALTIMORE   NaN  36 inch bins  HOWDEN TYPE         NaN   

   Grade        Date  Low Price  High Price  ...  Color  Environment  \
0    NaN  04/29/2017        270       280.0  ...    NaN          NaN   
1   

In [16]:
print("Baltimore Columna")
print(df_Baltimore.columns)

print("Boston Columns")
print(df_Boston.columns)

print("New York Columns")
print(df_New_York.columns)

print("Philadelphia Columns")
print(df_Philadelphia.columns)

Baltimore Columna
Index(['Commodity Name', 'City Name', 'Type', 'Package', 'Variety',
       'Sub Variety', 'Grade', 'Date', 'Low Price', 'High Price', 'Mostly Low',
       'Mostly High', 'Origin', 'Origin District', 'Item Size', 'Color',
       'Environment', 'Unit of Sale', 'Quality', 'Condition', 'Appearance',
       'Storage', 'Crop', 'Repack', 'Trans Mode'],
      dtype='object')
Boston Columns
Index(['Commodity Name', 'City Name', 'Type', 'Package', 'Variety',
       'Sub Variety', 'Grade', 'Date', 'Low Price', 'High Price', 'Mostly Low',
       'Mostly High', 'Origin', 'Origin District', 'Item Size', 'Color',
       'Environment', 'Unit of Sale', 'Quality', 'Condition', 'Appearance',
       'Storage', 'Crop', 'Repack', 'Trans Mode'],
      dtype='object')
New York Columns
Index(['Commodity Name', 'City Name', 'Type', 'Package', 'Variety',
       'Sub Variety', 'Grade', 'Date', 'Low Price', 'High Price', 'Mostly Low',
       'Mostly High', 'Origin', 'Origin District', 'Item Size'

## Clean Your Data

In the last lesson, we cleaned the data related to San Francisco. Pull up your notebook from the last lesson and use it as a reference to clean up these new dataframes.

In [9]:
# Clean your data here!
print("Missing Data Baltimore")
for col in df_Baltimore.columns:
    pct_missing = np.mean(df_Baltimore[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

print("Missing Data Boston")
for col in df_Boston.columns:
    pct_missing = np.mean(df_Boston[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

print("Missing Data New York")
for col in df_New_York.columns:
    pct_missing = np.mean(df_New_York[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

print("Missing Data Philadelphia")
for col in df_Philadelphia.columns:
    pct_missing = np.mean(df_Philadelphia[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

Missing Data Baltimore
Commodity Name - 0%
City Name - 0%
Type - 100%
Package - 0%
Variety - 1%
Sub Variety - 84%
Grade - 100%
Date - 0%
Low Price - 0%
High Price - 0%
Mostly Low - 0%
Mostly High - 0%
Origin - 3%
Origin District - 100%
Item Size - 16%
Color - 80%
Environment - 100%
Unit of Sale - 84%
Quality - 100%
Condition - 100%
Appearance - 100%
Storage - 100%
Crop - 100%
Repack - 0%
Trans Mode - 100%
Missing Data Boston
Commodity Name - 0%
City Name - 0%
Type - 100%
Package - 0%
Variety - 0%
Sub Variety - 92%
Grade - 100%
Date - 0%
Low Price - 0%
High Price - 0%
Mostly Low - 0%
Mostly High - 0%
Origin - 0%
Origin District - 81%
Item Size - 1%
Color - 14%
Environment - 100%
Unit of Sale - 87%
Quality - 100%
Condition - 100%
Appearance - 100%
Storage - 100%
Crop - 100%
Repack - 0%
Trans Mode - 100%
Missing Data New York
Commodity Name - 0%
City Name - 0%
Type - 100%
Package - 0%
Variety - 0%
Sub Variety - 84%
Grade - 100%
Date - 0%
Low Price - 0%
High Price - 0%
Mostly Low - 0%
Most

In [14]:
df_Baltimore['Color'].unique()
df_Boston['Color'].unique()
df_New_York['Color'].unique()
df_Philadelphia['Color'].unique()

array([nan])

In [19]:
print("Baltimore Value Counts")
print(df_Baltimore['Color'].value_counts())

print("Boston Value Counts")
print(df_Boston['Color'].value_counts())

print("New York Value Counts")
print(df_New_York['Color'].value_counts())

print("Philadelphia Value Counts")
print(df_Philadelphia['Color'].value_counts())

Baltimore Value Counts
Color
ORANGE    17
WHITE     14
Name: count, dtype: int64
Boston Value Counts
Color
ORANGE    276
WHITE      28
Name: count, dtype: int64
New York Value Counts
Color
ORANGE    11
WHITE     10
Name: count, dtype: int64
Philadelphia Value Counts
Series([], Name: count, dtype: int64)


## Combine Your Data

Now that you have four clean sets of data, combine all four into one dataframe that represents the entire Northeast region.

In [37]:
# Combine the four dataframes into one!
df_Northeast = pd.concat([df_Baltimore, df_Boston, df_New_York, df_Philadelphia], axis = 0)
print(df_Northeast)

df_Northeast.columns
df_Northeast = df_Northeast.columns.str.strip()
df_Northeast.dropna()

print(df_Northeast['City Name'].unique())

   Commodity Name     City Name  Type             Package      Variety  \
0        PUMPKINS     BALTIMORE   NaN        24 inch bins          NaN   
1        PUMPKINS     BALTIMORE   NaN        24 inch bins          NaN   
2        PUMPKINS     BALTIMORE   NaN        24 inch bins  HOWDEN TYPE   
3        PUMPKINS     BALTIMORE   NaN        24 inch bins  HOWDEN TYPE   
4        PUMPKINS     BALTIMORE   NaN        24 inch bins  HOWDEN TYPE   
..            ...           ...   ...                 ...          ...   
52       PUMPKINS  PHILADELPHIA   NaN  1/2 bushel cartons    MINIATURE   
53       PUMPKINS  PHILADELPHIA   NaN  1/2 bushel cartons    MINIATURE   
54       PUMPKINS  PHILADELPHIA   NaN  1/2 bushel cartons    MINIATURE   
55       PUMPKINS  PHILADELPHIA   NaN  1/2 bushel cartons    MINIATURE   
56       PUMPKINS  PHILADELPHIA   NaN  1/2 bushel cartons    MINIATURE   

   Sub Variety  Grade        Date  Low Price  High Price  ...  Color  \
0          NaN    NaN  04/29/2017      

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

## Answer Some Questions

Use `groupby()` and `agg()` to answer the following two questions:

1. What is the mean low and high prices for each type of **unit of sale** in the Northeast region? 
2. For each region, what is the average number of pumpkins per variety that came into terminal markets for the year? 

In [36]:
# Put your code here to find the mean low and high prices in the Northeast region for each type of unit of sale.
unit_of_sale = df_Northeast.groupby('City Name')[['Low Price','High Price']].mean()
print(unit_of_sale)

TypeError: Categorical input must be list-like

In [58]:
# Put your code here to find the average number of pumpkins coming into terminal markets of each variety.


## Bonus Mission

Try answering the same questions for the Midwest (Chicago, Detroit, and St. Louis) or the Southeast (Atlanta, Columbia, and Miami) regions.

In [59]:
# Try the bonus mission if you have time!