# Pandas data manipulation

In [None]:
import pandas as pd

## Import new dataset and explore it

In [None]:
cars = pd.read_csv('automobile_data.csv', sep=',', index_col=0)

In [None]:
cars

In [None]:
cars['company']

In [None]:
cars['company'].unique()

In [None]:
# Clean a bit the dataframe because the index are not continuous (61 rows, but index reach until 88)
cars.reset_index(inplace=True, drop=True)  
cars

## See first and last rows of the dataframe

In [None]:
cars.head(1)

In [None]:
cars.tail(1)

### First exercise: show the first five and the last 5 entries in the dataframe (in different commands)

## Select rows based on row number

In [None]:
cars.iloc[1]

In [None]:
cars.iloc[1:3]

In [None]:
cars.iloc[[1,5,8,42]]

In [None]:
cars.iloc[[1,5,8,42]]['horsepower']  # even more "chopping"

Do you miss **booleans** (True or False)? I miss booleans, let's do boleans. Let's find out which of the rows in the sublist of cars have the highest price. 

In [None]:
subsection_cars = cars.iloc[[1,5,8,42]]
subsection_cars.price == subsection_cars['price'].max()

In [None]:
subsection_cars[subsection_cars.price == subsection_cars['price'].max()]

Let's select all the cars from the company bmw

In [None]:
cars[cars['company']=='bmw']

Now all the cars with horsepower above 120

In [None]:
cars[cars['horsepower']>120]

### Second Exercise find the most expensive car in the whole cars dataframe

Now print only the name and the price

Let's count how many cars of each body-style are in the dataframe

In [None]:
cars['body-style'].value_counts()

### Third exercise: Count how many cars of each company are present in the dataframe

## Data per company

In [None]:
car_companies = cars.groupby('company')  # This creates a pandas object that is not readable by itself
car_companies

We can see that we have an iterable object that has grouped all the entries based on the value that the column 'company' contains. We shouldn't use a for loop to operate with it, but we will use it now just to illustrate that all the data is there, just not visible for the moment. 

In [None]:
for company in car_companies:
    print(company)

Now we want to know the maximum car price per company. To do that we'll use this grouped object and ask for its maximum values:

In [None]:
car_companies['price'].max()

In [None]:
car_companies['price'].min()

### Fourth exercise: Put both the maximum and minimum horsepower of each brand in a dataframe with the columns 'max_hp' and 'min_hp' respectively. This can be done in one line if you feel like facing a challenge. 

## Sort a dataframe based on the values of a column

Dataframes can be sorted according based on the data that it contains. Let's sort the cars dataframe based on their horsepower:

In [None]:
cars.sort_values(by='horsepower', ascending=True)

### Fifth exercise: sort the dataframe from more expensive to cheaper, then save it into a new variable and reset the index (dropping the original index). 

### Sixth exercise: Now remove all the rows for which the price is not a number (NaN)

Tip: check the `.notna()` function and remember how we normally subset dataframes. For reference, `isna()` does the opposite action. 

## Now let's add another entry to our original dataframe `cars`

In [None]:
new_entry = pd.DataFrame(data=[['opel', 'sedan', 100.0, 200.0, 'ohc', 'six', 100, 30, 25050.0]], columns = cars.columns)
new_entry

In [None]:
cars_with_new_entries = pd.concat([cars, new_entry])
cars_with_new_entries

### Seventh exercise: concatenate the following dataframes that I provide bellow.  

In [None]:
dfa = pd.DataFrame({'first_column':[1,2,3,4,5], 'second_colum':['A','B','C','D','E']})
dfa

In [None]:
dfb = pd.DataFrame({'first_column':[6,7], 'second_colum':['F', 'G']})
dfb

## Merge dataframes

In [None]:
dfa = pd.DataFrame({'first_column':[1,2,3,4,5], 'second_colum':['A','B','C','D','E']})
dfa

In [None]:
extra_info_df = pd.DataFrame({'first_column':[1,4,3,6], 'new_column': ['ham', 'eggs', 'spam', 'bacon']})
extra_info_df

In [None]:
merged_df = pd.merge(dfa,extra_info_df, on='first_column', how='left')  # Only keeps entries in dfa
merged_df

In [None]:
merged_df = pd.merge(dfa,extra_info_df, on='first_column', how='right')  # Only keeps entries in extra_info_df
merged_df

In [None]:
merged_df = pd.merge(dfa,extra_info_df, on='first_column', how='inner')  # Only keeps entries that appear both in dfa and extra_info_df
merged_df

In [None]:
merged_df = pd.merge(dfa,extra_info_df, on='first_column', how='outer')  # Keeps all the entries, it doesn't matter if it only appears on one of the dataframes
merged_df