# Objective
Dive deep into working with datasets using _pandas_.

# Things To Learn
* Creating and working with _pandas dataframes_ and _series_.
* Loading and getting an overview of data.
* Manipulating datasets.
* Selecting, grouping and sorting data.

# Submission Guidelines
* Your finished _Jupyter Notebook_ - both as `.ipynb` and exported `.pdf`.

# Task: Manually Creating A _Dataframe_ (Fabian Oppermann)

**Task:**
Import _pandas_ and manually create a _dataframe_ containing data about a domain **you're passionate about**. Your dataframe should...

* ...contain at least 4 rows.
* ...contain at least 3 column, of which at least one should be numeric and one should be suitable as a _key_ (i.e. _index_).

Next, manually create a _series_ and add it as additional column to your dataset.

Finally, set the index of your dataframe to a suitable column.

In [13]:
import pandas as pd

In [14]:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.html
data = {
    'Species': ['Acerifolia', 'Platanus', 'Pinus pinea', 'Quercus'],
    'Health': ['Good', 'Fair', 'Good', 'Good'],
    'Health_Number': [10, 20, 15, 40]
}

species_df = pd.DataFrame(data)

# Manually creating a series and adding it as an additional column
conservation_status = pd.Series(['Vulnerable', 'Endangered', 'Vulnerable', 'Vulnerable'], name='Conservation_Status')
species_df['Conservation_Status'] = conservation_status

# Setting the index of the dataframe to 'Species'
species_df.set_index('Species', inplace=True)

print(species_df)

            Health  Health_Number Conservation_Status
Species                                              
Acerifolia    Good             10          Vulnerable
Platanus      Fair             20          Endangered
Pinus pinea   Good             15          Vulnerable
Quercus       Good             40          Vulnerable


# Task: Importing And Getting An Overview (Fabian Oppermann)

Read the _ramen_ ratings from the provided file and get an overview of the data:

* Take a look at some rows from the top, bottom or random positions...
* Use summary functions to get a statistical overview of the data.
* Find out what data types we're dealing with.
* Identify what column would be suitable as index and set it!

In [15]:
ramen_df = pd.read_csv('./ramen-ratings.csv')

# https://pandas.pydata.org/docs/reference/frame.html
print(ramen_df.head()) # Top
print('-' * 50)
print(ramen_df.tail()) # Bottom
print('-' * 50)
print(ramen_df.sample(5)) # Random
print('-' * 50)
print(ramen_df.describe()) # Summary
print('-' * 50)
print(ramen_df.info()) # Info

   Review #           Brand  \
0      2580       New Touch   
1      2579        Just Way   
2      2578          Nissin   
3      2577         Wei Lih   
4      2576  Ching's Secret   

                                             Variety Style Country Stars  \
0                          T's Restaurant Tantanmen    Cup   Japan  3.75   
1  Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack  Taiwan     1   
2                      Cup Noodles Chicken Vegetable   Cup     USA  2.25   
3                      GGE Ramen Snack Tomato Flavor  Pack  Taiwan  2.75   
4                                    Singapore Curry  Pack   India  3.75   

  Top Ten  
0     NaN  
1     NaN  
2     NaN  
3     NaN  
4     NaN  
--------------------------------------------------
      Review #     Brand                                            Variety  \
2575         5     Vifon  Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty...   
2576         4   Wai Wai                     Oriental Style Instant Noodles

# Task: Dealing With Erroneous Rows

As you surely noticed, the star ratings are not included in the numerical statistics. Find out why and fix this, so we can work with the column:

* Take a look at the column's data type (should have been determined in the last task). Try casting it to `float`!
* Find out all the unique values in the column.
* Now that you know the faulty values, identify their rows and drop them. This should affect 3 rows.
* Try casting the the column again and make sure we can now calculate statistics on it by printing the mean of it.

In [16]:
# Converting 'Points' into int
pd.to_numeric(ramen_df['Stars'], errors='coerce')

# Unique values
print(ramen_df['Stars'].unique())

['3.75' '1' '2.25' '2.75' '4.75' '4' '0.25' '2.5' '5' '4.25' '4.5' '3.5'
 'Unrated' '1.5' '3.25' '2' '0' '3' '0.5' '4.00' '5.0' '3.50' '3.8' '4.3'
 '2.3' '5.00' '3.3' '4.0' '3.00' '1.75' '3.0' '4.50' '0.75' '1.25' '1.1'
 '2.1' '0.9' '3.1' '4.125' '3.125' '2.125' '2.9' '0.1' '2.8' '3.7' '3.4'
 '3.6' '2.85' '3.2' '3.65' '1.8']


# Task: Missing Values (Fabian Oppermann)

We don't want to deal with missing values in this example. Do the following:

* Find out how many values are missing in each column.
* One column should be missing 2 values. Identify the relevant rows and delete them.
* One column should be missing a lot of values. Delete the whole column.

In [17]:
# Missing values
print(ramen_df.isnull().sum())

#?

Review #       0
Brand          0
Variety        0
Style          2
Country        0
Stars          0
Top Ten     2539
dtype: int64


# Task: Create A Price-Column (Fabian Oppermann)

Unfortunately the ramen reviews don't contain a price column. Let's _fake_ one with rough estimates:

* A _pack_ of ramen should cost 0.79.
* A _bowl_ of ramen should cost 1.79.
* A _cup_ of ramen should cost 1.29.
* A _tray_ of ramen should cost 2.19.
* All other types should cost 1.09.

Create a function calculating the price based on the style. Use `map` to create a series and add it as a new column to your dataframe.

In [18]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html#pandas.DataFrame.map

COST_PACK = 0.79
COST_BOWL = 1.79
COST_CUP = 0.99
COST_TRAY = 2.19
COST_OTHER = 1.09


def calculate_price(style):
    if style == 'Pack':
        return 0.79
    elif style == 'Bowl':
        return 1.79
    elif style == 'Cup':
        return 1.29
    elif style == 'Tray':
        return 2.19
    else:
        return 1.09


ramen_df['Price'] = ramen_df['Style'].map(calculate_price)

print(ramen_df.head())

   Review #           Brand  \
0      2580       New Touch   
1      2579        Just Way   
2      2578          Nissin   
3      2577         Wei Lih   
4      2576  Ching's Secret   

                                             Variety Style Country Stars  \
0                          T's Restaurant Tantanmen    Cup   Japan  3.75   
1  Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack  Taiwan     1   
2                      Cup Noodles Chicken Vegetable   Cup     USA  2.25   
3                      GGE Ramen Snack Tomato Flavor  Pack  Taiwan  2.75   
4                                    Singapore Curry  Pack   India  3.75   

  Top Ten  Price  
0     NaN   1.29  
1     NaN   0.79  
2     NaN   1.29  
3     NaN   0.79  
4     NaN   0.79  


# Task: From Stars To Points (Fabian Oppermann)

Let's switch from a _star rating_ to a _point rating_ between 1 and 100.

* Calculate the points based on the stars, where 5 stars equal 100 points.
* Change the column name from _Stars_ to _Points_.

In [19]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html#pandas.DataFrame.rename

ramen_df = ramen_df.rename(columns={'Stars': 'Points'})
ramen_df['Points'] = ramen_df['Points'].apply(lambda x: float(x) if x != 'Unrated' else None) * 10

# Task: Create A Recommendation-Column (Fabian Oppermann)

Let's create a new column containing a textual recommendation:

* Ramen with points higher or equal than 90 points get either _Amazing value!_ (for prices benath 1.3) or _Expensive but delicious!_.
* Other ramen with ratings above 80 should read _Must-Try!_.
* Other cheap ramen with a price below 1 should read _Budget choice!_.
* All other ramen should read _Why not?_.

Since you need multiple columns to calculate the recommendation, you need to use `apply`. Check how often each recommendation text appears afterwards!

In [20]:
ramen_df['Textual_Recom'] = ramen_df.apply(
    lambda row: 'Amazing value!' if float(row['Points']) >= 90 and float(row['Price']) < 1.3 else
                'Expensive but delicious!' if float(row['Points']) >= 90 else
                'Must-Try!' if float(row['Points']) > 80 else
                'Budget choice!' if float(row['Price']) < 1 else
                'Why not?', axis=1)

print(ramen_df['Textual_Recom'].value_counts())

Textual_Recom
Budget choice!    1531
Why not?          1049
Name: count, dtype: int64


# Task: Export Data

At this point it makes sense to backup our processed dataframe. Export the data into a file `ramen_processed.csv` into a subfolder `output` and compare your results to the provided solution.

In [21]:
import os

# Create the output directory if it does not exist
os.makedirs('output', exist_ok=True)

ramen_df.to_csv('output/ramen_processed.csv', index=False)

# Task: Selecting Data (_Integer-Based_) (Fabian Oppermann)

You can now either use the dataset you've worked on so far or - if in doubt - load our `ramen_processed_solution.csv` for the next tasks.

Solve the following tasks to sharpen your selecting skills:

* Get the first two columns of the last ten rows.
* Get the 4<sup>th</sup> column of the 15<sup>th</sup> row.
* Get the second and the last column of the second last ten rows.
* Get everything but the first column of the 20<sup>th</sup>, 30<sup>th</sup>, 40<sup>th</sup> and 50</sup>th</sup> row.
* Get every column of the 100<sup>th</sup> up to (and including) the 200<sup>th</sup> row.
* Get the third column of every row.

In [22]:
# Get the first two columns of the last ten rows
print(ramen_df.iloc[-10:, :2])
print('-' * 50)

# Get the 4th column of the 15th row
print(ramen_df.iloc[14, 3])
print('-' * 50)

# Get the second and the last column of the second last ten rows
print(ramen_df.iloc[-20:-10, [1, -1]])
print('-' * 50)

# Get everything but the first column of the 20th, 30th, 40th and 50th row
print(ramen_df.iloc[[19, 29, 39, 49], 1:])
print('-' * 50)

# Get every column of the 100th up to (and including) the 200th row
print(ramen_df.iloc[99:200, :])
print('-' * 50)

# Get the third column of every row
print(ramen_df.iloc[:, 2])
print('-' * 50)

      Review #     Brand
2570        10     Smack
2571         9     Sutah
2572         8    Tung-I
2573         7   Ve Wong
2574         6     Vifon
2575         5     Vifon
2576         4   Wai Wai
2577         3   Wai Wai
2578         2   Wai Wai
2579         1  Westbrae
--------------------------------------------------
Pack
--------------------------------------------------
                Brand   Textual_Recom
2560         Nongshim  Budget choice!
2561         Nongshim  Budget choice!
2562           Ottogi  Budget choice!
2563        Quickchow  Budget choice!
2564          Samyang  Budget choice!
2565          Samyang  Budget choice!
2566          Samyang  Budget choice!
2567  Sapporo Ichiban  Budget choice!
2568  Sapporo Ichiban  Budget choice!
2569      Six Fortune  Budget choice!
--------------------------------------------------
            Brand                                            Variety Style  \
19          Paldo                                    Premium Gomtang  P

# Task: Selecting Data (_Label-Based_) (Fabian Oppermann)

Solve the following tasks to sharpen your selecting skills:

* Get the variety and textual recommendation for the review _#1235_.
* Get all columns but the textual recommendation for the reviews from (and including) _#5_ to (and including) _#10_.
* Get the style and points for the reviews _#123_, _#234_ and _#345_.
* Get the variety, style and country for all reviews.

In [23]:
# Get the variety and textual recommendation for the review #1235
print(ramen_df.loc[ramen_df['Review #'] == 1235, ['Variety', 'Textual_Recom']])
print('-' * 50)

# Get all columns but the textual recommendation for the reviews from #5 to #10
print(ramen_df.loc[(ramen_df['Review #'] >= 5) & (ramen_df['Review #'] <= 10)].drop(columns=['Textual_Recom']))
print('-' * 50)

# Get the style and points for the reviews #123, #234 and #345
print(ramen_df.loc[ramen_df['Review #'].isin([123, 234, 345]), ['Style', 'Points']])
print('-' * 50)

# Get the variety, style and country for all reviews
print(ramen_df[['Variety', 'Style', 'Country']])
print('-' * 50)

                                                Variety   Textual_Recom
1345  Demae Ramen Miso Tonkotsu Artificial Pork Flav...  Budget choice!
--------------------------------------------------
      Review #    Brand                                            Variety  \
2570        10    Smack                                     Vegetable Beef   
2571         9    Sutah                                         Cup Noodle   
2572         8   Tung-I                   Chinese Beef Instant Rice Noodle   
2573         7  Ve Wong                                      Mushroom Pork   
2574         6    Vifon                                           Nam Vang   
2575         5    Vifon  Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty...   

     Style      Country  Points Top Ten  Price  
2570  Pack          USA    15.0     NaN   0.79  
2571   Cup  South Korea    20.0     NaN   1.29  
2572  Pack       Taiwan    30.0     NaN   0.79  
2573  Pack      Vietnam    10.0     NaN   0.79  
2574  Pack   

# Task: Conditional Selection (Fabian Oppermann)

Solve the following tasks to sharpen your selection skills:

* Get everything about ramen that comes in a bowl.
* Get variety, style and points of all ramen that comes from Germany.
* Get the columns from brand up to country of all ramen that comes in a cup and has a rating lower than 10 points.
* Get brand, variety and points of all ramen either produced by _Samyang_ or having a rating over 95 points.
* Get everything but price and recommendation for all ramen containing _Hello Kitty_ in their variety field.
* Get everything up to country for all ramen from the brands _Knorr_, _Vifon_ and _Yum Yum_.

In [24]:
# Get everything about ramen that comes in a bowl
print(ramen_df[ramen_df['Style'] == 'Bowl'])
print('-' * 50)

# Get variety, style and points of all ramen that comes from Germany
print(ramen_df[ramen_df['Country'] == 'Germany'][['Variety', 'Style', 'Points']])
print('-' * 50)

# Get the columns from brand up to country of all ramen that comes in a cup and has a rating lower than 10 points
print(ramen_df[(ramen_df['Style'] == 'Cup') & (ramen_df['Points'] < 10)].loc[:, 'Brand':'Country'])
print('-' * 50)

# Get brand, variety and points of all ramen either produced by Samyang or having a rating over 95 points
print(ramen_df[(ramen_df['Brand'] == 'Samyang') | (ramen_df['Points'] > 95)][['Brand', 'Variety', 'Points']])
print('-' * 50)

# Get everything but price and recommendation for all ramen containing Hello Kitty in their variety field
print(ramen_df[ramen_df['Variety'].str.contains('Hello Kitty')].drop(columns=['Price', 'Textual_Recom']))
print('-' * 50)

# Get everything up to country for all ramen from the brands Knorr, Vifon and Yum Yum
print(ramen_df[ramen_df['Brand'].isin(['Knorr', 'Vifon', 'Yum Yum'])].loc[:, :'Country'])
print('-' * 50)

      Review #             Brand  \
13        2567            Nissin   
25        2555     Samyang Foods   
27        2553            Nissin   
28        2552           MyKuali   
33        2547  Sichuan Guangyou   
...        ...               ...   
2537        43       Kim Ve Wong   
2539        41       Little Cook   
2541        39         Lucky Me!   
2551        29          Mee Jang   
2575         5             Vifon   

                                                Variety Style      Country  \
13                         Deka Buto Kimchi Pork Flavor  Bowl        Japan   
25                            Song Song Kimchi Big Bowl  Bowl  South Korea   
27                   Hakata Ramen Noodle White Tonkotsu  Bowl        Japan   
28              Penang White Curry Rice Vermicelli Soup  Bowl     Malaysia   
33                          Chongqing Spicy Hot Noodles  Bowl        China   
...                                                 ...   ...          ...   
2537          Jaopai 

# Task: Grouping And Sorting (Fabian Oppermann)

Use grouping to solve the following tasks:

* Print out the number of reviews per brand in ascending order.
* Print out all the styles and their mean rating points - in descending order.
* Print out the minimum, maximum and mean price per brand, sorted by the maximum values.
* Print out all the values of the highest rated ramen per style!
* Print out the count and mean of ratings per style per brand, sorted by the count.

In [25]:
# Print out the number of reviews per brand in ascending order
print(ramen_df['Brand'].value_counts().sort_values())
print('-' * 50)

# Print out all the styles and their mean rating points - in descending order
print(ramen_df.groupby('Style')['Points'].mean().sort_values(ascending=False))
print('-' * 50)

# Print out the minimum, maximum and mean price per brand, sorted by the maximum values
print(ramen_df.groupby('Brand')['Price'].agg(['min', 'max', 'mean']).sort_values(by='max', ascending=False))
print('-' * 50)

# Print out all the values of the highest rated ramen per style
print(ramen_df.loc[ramen_df.groupby('Style')['Points'].max()])
print('-' * 50)

# Print out the count and mean of ratings per style per brand, sorted by the count
print(ramen_df.groupby(['Brand', 'Style'])['Points'].agg(['count', 'mean']).sort_values(by='count', ascending=False))
print('-' * 50)

Brand
Westbrae              1
Dongwon               1
S&S                   1
Yum-Mie               1
Jackpot Teriyaki      1
                   ... 
Paldo                66
Mama                 71
Maruchan             76
Nongshim             98
Nissin              381
Name: count, Length: 355, dtype: int64
--------------------------------------------------
Style
Bar     50.000000
Box     42.916667
Pack    37.004581
Bowl    36.706861
Tray    35.451389
Can     35.000000
Cup     34.985000
Name: Points, dtype: float64
--------------------------------------------------
               min   max      mean
Brand                             
Vina Acecook  0.79  2.19  0.963529
Bon Go Jang   1.79  2.19  1.990000
Thai Kitchen  0.79  2.19  1.370000
Doll          0.79  2.19  0.940000
Daikoku       1.29  2.19  1.690000
...            ...   ...       ...
Weh Lih       0.79  0.79  0.790000
Wei Chuan     0.79  0.79  0.790000
Zow Zow       0.79  0.79  0.790000
iMee          0.79  0.79  0.790000
iNoodle 

# Good Job!