<a href="https://colab.research.google.com/github/Echevarriaj93/gradwork2022/blob/main/Jose_Echevarria_Assignment3_Pandas_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment for Pandas Basics Study Unit

These coding questions are designed to test whether you are ready to move on to the remainder of this course. They test your knowledge of Pandas fundamentals (what you learned in this study unit).

Please make sure:
- you read and follow the instructions and comments closely;
- you do not change the provided code in anyway;
- you DO NOT copy any code from any source;
- you provide enough comments/pseudo code for your code.

## Import dependencies


In [None]:
import pandas as pd

## Loading the data

The dataset is collected containing the order information from a Chiptole restaurant.

In [None]:
data_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'

order_df = pd.read_csv(data_url, sep = '\t')

In [None]:
# investigate what the data looks like
order_df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


## Task 0: check generic descriptive information of the data (10 points)

The first step using any dataset is to investigate how much data you have (__size__), and what type of data you have (__data types__). Other important information regarding learning the __column names__ and the __missing data__ in the dataset. 

Let's do these steps one by one.



### Step 0.1: How much data do you have here?

In [None]:
#Checking shape then size

#order_df.shape

order_df.size #The df has 23110 items or 4622 rows by 5 columns



23110

### Step 0.2: What are the data types?

In [None]:
# write the code to discover the data types
order_df.dtypes #thoughts - may need to chnage price into a float. have to do something about the $ sign

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

### Step 0.3: What are the column names?

In [None]:
# write the code to discover the column names
order_df.columns #column names match what we see in .head() snippet

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

### Step 0.4: Missing data detection

You can use the ``.isna()`` method for this. Refer to the _missing data_ chapter in the lecture notebooks for more detail.

In [None]:
# write your code to get the count/ratio of missing value in EACH column

#While the count is useful in some cases the percentage can also tell us how much of the data will be useful

# take total na values * 100 and dvide by total values 
order_df.isna().sum() * 100/ len(order_df) #Only the choice_description column has na values. 27% of the values are missing

order_id               0.000000
quantity               0.000000
item_name              0.000000
choice_description    26.958027
item_price             0.000000
dtype: float64

## Task 1: Creating new features based on existing features (15 points)

While analyzing the data, a common task is that we create new features based on existing features. There are two different use cases of this:

1. Create a new feature based on the __calculation__ of two features;
2. Create a new feature based on the __condition__ of one or more features.

Let's try both of them.

### Step 1.0: Creating a new feature by transforming an existing feature

We notice that the ``item_price`` feature is in ``object`` type, which means they are in ``strings`` data type. This is not very convinient if we need to perform arithmetic calculations on them. Let us create a new feature ``price_value`` by stripping the first character ``$`` in them. Then we can convert the data type of ``price_value`` as ``float32``.

__HINT__: for any ``string`` column in Pandas, you can access its values using the ``.str`` attribute.

In [None]:
#We can decide what part of the string to remove by specifying with .str.strip()
#Now we can convert the remaining numeric part of the string into a float with
#.astype()

order_df['price_value'] = order_df['item_price'].str.strip('$').astype('float32')
order_df['price_value'].head()

0     2.39
1     3.39
2     3.39
3     2.39
4    16.98
Name: price_value, dtype: float32

With the new column created, we can prepare a new DF for the subsequent steps.

The new DF ``price_df`` contains the quantity of items and total price for each order.

In [None]:
# preparing new data
price_df = order_df.groupby('order_id').agg({'quantity': 'sum', 'price_value':sum})


price_df.head()

Unnamed: 0_level_0,quantity,price_value
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4,11.56
2,2,16.98
3,2,12.67
4,2,21.0
5,2,13.7



### Step 1.1: Calculating a new feature based on two existing features

With ``price_df``, we can create a new column ``average_price`` that is the ``price_value`` divided by ``quantity`` in each order.

In [None]:
#Get average price by calling the df[key] of price value and dividing it by the df[key] for quantity
#Rounded to 2 decimal places to match price_value format

price_df['average_price'] = round(price_df['price_value']/ price_df['quantity'], 2)

price_df.head() 

Unnamed: 0_level_0,quantity,price_value,average_price
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4,11.56,2.89
2,2,16.98,8.49
3,2,12.67,6.34
4,2,21.0,10.5
5,2,13.7,6.85


Find out the information (entire row) of the order in ``price_df`` with the highest ``average_price``.

In [None]:
#The code is set to filter for rows in the data set where the avregae price is highest
#It appears there is a tie between index 123 and 253

price_df[price_df['average_price'] == price_df['average_price'].max()]

# price_df.loc[price_df['average_price'].idxmax()] #This is shorter but will only show the first row if there is a tie




Unnamed: 0_level_0,quantity,price_value,average_price
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
123,2,23.780001,11.89
253,2,23.780001,11.89


In [None]:
#confirming max value

price_df['average_price'].max()

11.89

### Step 1.2: Creating a new feature based on the condition of other features

We are going to implement this in the next assignment.

## Task 2: Filter Values based on Frequencies (25 points)

In analytics project, we are particularly careful with categorical features. Common actions include, but not limit to:

- Understanding unique values (categories) in the feature(s);
- Investigating the counts of the unique values in the features;
- merging categories (unique values) by only keeping the _top n_ categories in the column, and convert the remaining values into one category (usually ``other`` as a catching-all category).

Let's practice these steps one by one.



### Step 2.1: Understanding the unique values in a categorical column

Find out the unique values in the ``item_name`` column. 

__HINT__: Pandas provides a method ``.unique()`` for this purpose.

In [None]:
# I specify the column in the df by calling the column name as the key. This 
#will display the unique values in the column specified
order_df['item_name'].unique()

array(['Chips and Fresh Tomato Salsa', 'Izze', 'Nantucket Nectar',
       'Chips and Tomatillo-Green Chili Salsa', 'Chicken Bowl',
       'Side of Chips', 'Steak Burrito', 'Steak Soft Tacos',
       'Chips and Guacamole', 'Chicken Crispy Tacos',
       'Chicken Soft Tacos', 'Chicken Burrito', 'Canned Soda',
       'Barbacoa Burrito', 'Carnitas Burrito', 'Carnitas Bowl',
       'Bottled Water', 'Chips and Tomatillo Green Chili Salsa',
       'Barbacoa Bowl', 'Chips', 'Chicken Salad Bowl', 'Steak Bowl',
       'Barbacoa Soft Tacos', 'Veggie Burrito', 'Veggie Bowl',
       'Steak Crispy Tacos', 'Chips and Tomatillo Red Chili Salsa',
       'Barbacoa Crispy Tacos', 'Veggie Salad Bowl',
       'Chips and Roasted Chili-Corn Salsa',
       'Chips and Roasted Chili Corn Salsa', 'Carnitas Soft Tacos',
       'Chicken Salad', 'Canned Soft Drink', 'Steak Salad Bowl',
       '6 Pack Soft Drink', 'Chips and Tomatillo-Red Chili Salsa', 'Bowl',
       'Burrito', 'Crispy Tacos', 'Carnitas Crispy Tacos

Above results show the unique values in the ``item_name`` column. You can observe that some of the categories are not unique because of spelling variations (`` `` and ``-``).

### Step 2.2: Find how many unique values in a categorical column

Find out how many unique values in the ``item_name`` column. There are a few different ways to do this.

In [None]:
# I tried to strip the "" and '-' characters to unify the values that should be 
#unique

#str.strip is not effectiove here but wanted to provide my logic
#nunique() provides the number of unique values in the column

# order_df['item_name'].str.strip('-').nunique() #doesnt work

order_df['item_name'].nunique()

50

### Step 2.3: Find out the counts for all the unique values in a categorical column

Find out the counts of unique ``item_name`` in the column.

__HINT__: Pandas provides a method ``.value_counts()`` for this purpose.

In [None]:
#pd.value_counts() will display the number of instances of each unique value
# in the specified column
#str.srtip() is ineffective

pd.value_counts(order_df['item_name'])

Chicken Bowl                             726
Chicken Burrito                          553
Chips and Guacamole                      479
Steak Burrito                            368
Canned Soft Drink                        301
Steak Bowl                               211
Chips                                    211
Bottled Water                            162
Chicken Soft Tacos                       115
Chips and Fresh Tomato Salsa             110
Chicken Salad Bowl                       110
Canned Soda                              104
Side of Chips                            101
Veggie Burrito                            95
Barbacoa Burrito                          91
Veggie Bowl                               85
Carnitas Bowl                             68
Barbacoa Bowl                             66
Carnitas Burrito                          59
Steak Soft Tacos                          55
6 Pack Soft Drink                         54
Chips and Tomatillo Red Chili Salsa       48
Chicken Cr

### Step 2.4: Merging categories (unique values) in a categorical column

Transform the unique values in ``item_name`` in a new Pandas Series ``item_category`` by keeping the __top 9__ categories by _frequency_ and merge everything else as ``'other'``.

In [None]:
# create a Pandas Series ``item_counts`` to hold the value counts in ``item_name``

# item_counts = pd.Series.value_counts(order_df['item_name'])# no need to repeat pd call when calling two pandas methods in a row

#Simpler
item_counts = order_df['item_name'].value_counts()

item_counts

Chicken Bowl                             726
Chicken Burrito                          553
Chips and Guacamole                      479
Steak Burrito                            368
Canned Soft Drink                        301
Steak Bowl                               211
Chips                                    211
Bottled Water                            162
Chicken Soft Tacos                       115
Chips and Fresh Tomato Salsa             110
Chicken Salad Bowl                       110
Canned Soda                              104
Side of Chips                            101
Veggie Burrito                            95
Barbacoa Burrito                          91
Veggie Bowl                               85
Carnitas Bowl                             68
Barbacoa Bowl                             66
Carnitas Burrito                          59
Steak Soft Tacos                          55
6 Pack Soft Drink                         54
Chips and Tomatillo Red Chili Salsa       48
Chicken Cr

In [None]:
# find out the top 9 categories by frequency
# HINT: note that ``item_counts`` are sorted, so we just need the index of the first 9 elements 

#getting the index of the values 
# top9 = order_df['item_name'].value_counts()[:10].index # the top 9 values in item_counts by frequency

#simpler
top9 = item_counts[0:10].index
top9

Index(['Chicken Bowl', 'Chicken Burrito', 'Chips and Guacamole',
       'Steak Burrito', 'Canned Soft Drink', 'Steak Bowl', 'Chips',
       'Bottled Water', 'Chicken Soft Tacos', 'Chips and Fresh Tomato Salsa'],
      dtype='object')

In [None]:
# now it is time to replace the values in ``item_name``
# first retrieve the values in ``item_name``
item_names = order_df['item_name'] # Assigning item_name column to item_names variable

item_names.head()

0             Chips and Fresh Tomato Salsa
1                                     Izze
2                         Nantucket Nectar
3    Chips and Tomatillo-Green Chili Salsa
4                             Chicken Bowl
Name: item_name, dtype: object

Next we need to find out whether a certain value ``item_name`` is in ``top9``.

Pandas provides a method ``.isin()`` for this purpose. Let's first see which values are in ``top9``.

In [None]:
# Creating bolean values for item_names values based on whether they are in top9 or not

item_names.isin(top9)



0        True
1       False
2       False
3       False
4        True
        ...  
4617     True
4618     True
4619    False
4620    False
4621    False
Name: item_name, Length: 4622, dtype: bool

Using the knowledge from binary indexing/mask, we know the above results showing that which value is (``True``) or not (``False``) in ``top9``. 

Now we need the reverse of that, meaning we want the value in ``top9`` as ``False`` (__not to be replaced__) and not as ``True`` (__to be replaced__).

Pandas provides an operator ``~`` for this purpose. Observe the example below.

In [None]:
my_ser = pd.Series([False, True, False, True, True])
~my_ser

0     True
1    False
2     True
3    False
4    False
dtype: bool

In [None]:
# make a copy of ``item_names`` as ``item_category``
item_category = item_names.copy() # copy to not alter original data



In [None]:
item_category 

0                Chips and Fresh Tomato Salsa
1                                        Izze
2                            Nantucket Nectar
3       Chips and Tomatillo-Green Chili Salsa
4                                Chicken Bowl
                        ...                  
4617                            Steak Burrito
4618                            Steak Burrito
4619                       Chicken Salad Bowl
4620                       Chicken Salad Bowl
4621                       Chicken Salad Bowl
Name: item_name, Length: 4622, dtype: object

In [None]:
import numpy as np

# write you code to :
# reverse ``item_names.isin(top9)`` and use the result to index on ``item_names``
# for the values as ``True``, replace them as ``'Other'`` and keep the others unchanged


boolnames = item_names.isin(top9) #do the items match any in the top 9

#creating a new column (name) where all the values NOT (~ = reverse) in top 9 are turned into 'Other'
item_category['name'] = np.where(~boolnames == True, 'Other', item_names)




In [None]:
item_category.name = 'item_category'




Next, merge ``item_category`` back to ``order_df`` as the last (right most) column.

__HINT__: you should use `pd.concat()` function, and name the results as ``order_df``.

In [None]:
#new column in order_df called and assigned item_category['name']

#creating a new key for a new column. new key and assgining the item_category values

#simpler way can be used when needed
# order_df['item_category'] = item_category['name']

#When more specification is needed
order_df = pd.concat([order_df,pd.DataFrame(item_category['name'], columns=['item_category'])], axis=1)



order_df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,price_value,item_category
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,2.39,Chips and Fresh Tomato Salsa
1,1,1,Izze,[Clementine],$3.39,3.39,Other
2,1,1,Nantucket Nectar,[Apple],$3.39,3.39,Other
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,2.39,Other
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,16.98,Chicken Bowl


In [None]:
1# let's check the value counts again to make sure it fits our purpose

# Checking the values counts after assignig 'Other' to the less important values


order_df['item_category'].value_counts()

Other                           1386
Chicken Bowl                     726
Chicken Burrito                  553
Chips and Guacamole              479
Steak Burrito                    368
Canned Soft Drink                301
Chips                            211
Steak Bowl                       211
Bottled Water                    162
Chicken Soft Tacos               115
Chips and Fresh Tomato Salsa     110
Name: item_category, dtype: int64

That is all for this assignment. Please submit this work when you are done.