# Lecture 9 - Pandas

[Pandas](https://pandas.pydata.org/docs/) is an essential package for working with data: it allows you to read/write your data, clean it, transform it and analyse it. It is actually bulit _on top_ of NumPy. In today's lecture, we will cover some basic functionality of [Pandas](https://pandas.pydata.org/docs/), and in the workshop we will try to apply it to a dataset. 

We will cover the following topics:
- [Pandas DataFrames and Series](#Pandas-DataFrames-and-Series)
    - [Basic information about the data](#Basic-information-about-the-data)
    - [Selecting parts of the DataFrame](#Selecting-parts-of-the-DataFrame)
    - [Manipulating the columns](#Manipulating-the-columns)
- [Analysing the dataset](#Analysing-the-dataset)
- [Missing values](#Missing-values)
    - [Removing rows with missing values](#Removing-rows-with-missing-values)
    - [Imputation](#Imputation)
- [Reading from and writing into a file](#Reading-from-and-writing-into-a-file)

Just like Numpy, [Pandas](https://pandas.pydata.org/docs/) is an **external library**. This means you need to `import` the `pandas` module just like you had with `numpy`. In fact, let's `import` both of these modules for today's lecture:

In [11]:
import pandas as pd
import numpy as np

Just like with NumPy, you could use any alias other than `pd`, but `pd` is standard.

## Pandas DataFrames and Series

While NumPy was centered around _arrays_ which could be created from Python _lists_, Pandas is centered around [_DataFrames_](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and [_Series_](https://pandas.pydata.org/docs/reference/api/pandas.Series.html).

A [_DataFrame_](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) can be created from a Python _dictionary_. Specifically, this should be a dictionary associating _strings_ to _lists_, and the lists should contain _the same number of elements_. When a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) is created:
- you should think of each _key_ in the dictionary as a _feature_ ("characteristic") of a sample (which has been measured/recorded)
- you should think of each _element_ in the list as the value of that _feature_ for a specific sample

Let us create a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with three features (`'cats_owned'`, `'dogs_owned'`, `'parrots_owned'`) and 6 samples.

In [12]:
pets_dict = {
    'cats_owned':    [0, 1, 3, 1, 0, 0],
    'dogs_owned':    [2, 1, 0, 0, 0, 0],
    'parrots_owned': [0, 1, 0, 0, 1, 3]
}

pets_df = pd.DataFrame(pets_dict)

display(pets_df)

Unnamed: 0,cats_owned,dogs_owned,parrots_owned
0,0,2,0
1,1,1,1
2,3,0,0
3,1,0,0
4,0,0,1
5,0,0,3


In the above example, sample number $3$ has the value $(1, 0, 0)$, interpreted as **this person has one cat, no dogs and no parrots**.

In the previous example, the samples in the data frame were indexed automatically, by numbering them from zero. However, we can also add an index ("name the people in the dataset") or an ID to every sample.

This can be done it two ways:
- by changing [`DataFrame.index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html) if the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) has already been created
- by passing the value of `index` when we are creating the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

Let us name the 6 people in our dataset _Jack, Mary, Bob, Kay, Tim, Jane_, and add this information to the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):

In [13]:
pets_df.index = ['Jack', 'Mary', 'Bob', 'Kay', 'Tim', 'Jane']
pets_df

Unnamed: 0,cats_owned,dogs_owned,parrots_owned
Jack,0,2,0
Mary,1,1,1
Bob,3,0,0
Kay,1,0,0
Tim,0,0,1
Jane,0,0,3


Let us instead pass the `index` containing the names _Jack, Mary, Bob, Kay, Tim, Jane_ as an argument while creating the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):

In [14]:
users = ['Jack', 'Mary', 'Bob', 'Kay', 'Tim', 'Jane']

pets_df = pd.DataFrame(pets_dict, index = users)
pets_df

Unnamed: 0,cats_owned,dogs_owned,parrots_owned
Jack,0,2,0
Mary,1,1,1
Bob,3,0,0
Kay,1,0,0
Tim,0,0,1
Jane,0,0,3


### Basic information about the data

Similar to NumPy arrays on which they are based, pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) also have the attribute [`DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) which allows you to see the size of the underlying data. Similarly to NumPy, the first element of [`DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) typically refers to the number of rows ("samples"), and the second element to the number of columns ("features").

In [15]:
pets_df.shape

(6, 3)

To get some more details about the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), we can also use [`DataFrame.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)

In [16]:
pets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Jack to Jane
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   cats_owned     6 non-null      int64
 1   dogs_owned     6 non-null      int64
 2   parrots_owned  6 non-null      int64
dtypes: int64(3)
memory usage: 192.0+ bytes


### Selecting parts of the DataFrame

Pandas actually has a more granular structure than a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) -- the Pandas [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). A [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) is simply a single column (or row; but not a whole table) from a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). There are several ways to select a column from a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), but the simplest is by name:

In [17]:
pets_df['dogs_owned']

Jack    2
Mary    1
Bob     0
Kay     0
Tim     0
Jane    0
Name: dogs_owned, dtype: int64

There's a convenient Pandas sytanx for selecting columns -- if the column name has no spaces (i.e. if it would be a _valid variable name_), you can omit the square brackets:

In [18]:
pets_df.cats_owned

Jack    0
Mary    1
Bob     3
Kay     1
Tim     0
Jane    0
Name: cats_owned, dtype: int64

To select a row instead of a column, we can use [`Dataframe.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html):

In [19]:
pets_df.loc['Mary']

cats_owned       1
dogs_owned       1
parrots_owned    1
Name: Mary, dtype: int64

[`Dataframe.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) is actually more versatile and can be used to select columns as well (with a similar syntax to NumPy arrays).

In this example, we also see column splicing by key, but _note that both the first and the last key are included_.

In [20]:
pets_df.loc['Mary':'Tim', ['dogs_owned', 'cats_owned']]
#pets_df.dogs_owned

Unnamed: 0,dogs_owned,cats_owned
Mary,1,1
Bob,0,3
Kay,0,1
Tim,0,0


Finally, [`Dataframe.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) also allows us to select by truth-value (a list of boolean values iterpreted as a mask). The list of truth-values must be equal to the number of columns in the dataframe (or rows, if selecting rows):

In [40]:
#pets_df
#pets_df.loc[[True, False, True, True, False, False]]
pets_df.loc[:, [True, False, True]]

Unnamed: 0,cats_owned,parrots_owned
Jack,0,0
Mary,1,1
Bob,3,0
Kay,1,0
Tim,0,1
Jane,0,3


This allows us to select samples for the dataset based on certain criteria, just like in NumPy. For example, select all the people who own at least one dog:

In [22]:
pets_df
pets_df.loc[pets_df.dogs_owned >= 1].cats_owned

Jack    0
Mary    1
Name: cats_owned, dtype: int64

If we, for some reason, want to use the (integer) column and row indices (like in `numpy`), we can use [`Dataframe.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) instead:

In [23]:
pets_df
pets_df.iloc[2:-1, 1:]

Unnamed: 0,dogs_owned,parrots_owned
Bob,0,0
Kay,0,0
Tim,0,1


### Manipulating the columns

We can access the column ("feature") names and manipulate them.

Renaming all the columns can be done with [`DataFrame.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

A lot of Pandas operations have the `inplace` argument:
- setting `inplace = False` (default) returns a _copy_ of the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (which you can assign to a new variable).
- setting `inplace = True` (must be done by hand) changes the current [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) on which you are working

In [25]:
print(pets_df.columns)
pets_df.rename(columns={
        'cats_owned' : 'cats',
        'parrots_owned' : 'parrots',    
        'dogs_owned' :'dogs',        
    }, inplace = True)
#pets_df

Index(['cats_owned', 'dogs_owned', 'parrots_owned'], dtype='object')


In [24]:
display(pets_df)

Unnamed: 0,cats_owned,dogs_owned,parrots_owned
Jack,0,2,0
Mary,1,1,1
Bob,3,0,0
Kay,1,0,0
Tim,0,0,1
Jane,0,0,3


A quicker way to do this is to assign a list of new column names to [`Dataframe.columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html):

In [26]:
pets_df.columns = ['CATS', 'DOGS', 'PARROTS']
pets_df

Unnamed: 0,CATS,DOGS,PARROTS
Jack,0,2,0
Mary,1,1,1
Bob,3,0,0
Kay,1,0,0
Tim,0,0,1
Jane,0,0,3


This allows us to apply some change to all column names at once using Python list comperhension. For example, making the column features all lowercase and removing the last letter from every column name (making it a singular):

In [27]:
pets_df.columns = [colname[:-1].lower() for colname in pets_df.columns]
pets_df

Unnamed: 0,cat,dog,parrot
Jack,0,2,0
Mary,1,1,1
Bob,3,0,0
Kay,1,0,0
Tim,0,0,1
Jane,0,0,3


Finally, you can get the "raw" data from a dataframe using the [`DataFrame.to_numpy()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html) functionality.

You will usually do this after the data has been cleaned and analysed (which we will show next)!

In [69]:
pets_array = pets_df.to_numpy()
print(pets_array, type(pets_array))

[[0 2 0]
 [1 1 1]
 [3 0 0]
 [1 0 0]
 [0 0 1]
 [0 0 3]] <class 'numpy.ndarray'>


However, unlike with NumPy arrays, Pandas [`DataFrame`s](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) do not require all elements to be of the same type. In fact, we can mix different numerical types (integers, floats) with more complex data types.

In [29]:
packages_df = pd.DataFrame(
        {
            "height"  : [1.23, 0.20, 2.12, 0.71, 0.14, 0.83, 0.73],
            "width"   : [2.17, 0.85, 1.58, 0.51, 0.23, 1.02, 0.91],
            "length"  : [0.84, 0.31, 2.40, 0.63, 0.06, 0.51, 1.10],
            "shipping": ["standard", "urgent", "tracked", "tracked", "urgent", "standard", "standard"]
        }
    )
packages_df

Unnamed: 0,height,width,length,shipping
0,1.23,2.17,0.84,standard
1,0.2,0.85,0.31,urgent
2,2.12,1.58,2.4,tracked
3,0.71,0.51,0.63,tracked
4,0.14,0.23,0.06,urgent
5,0.83,1.02,0.51,standard
6,0.73,0.91,1.1,standard


In [30]:
packages_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   height    7 non-null      float64
 1   width     7 non-null      float64
 2   length    7 non-null      float64
 3   shipping  7 non-null      object 
dtypes: float64(3), object(1)
memory usage: 356.0+ bytes


Pandas also defines some of it's own data types. An important one is `category` which can be used to indicate a feature is a categorical feature with values from a pre-determined set (e.g. "standard", "tracked" and "urgent").

We can change a type of a DataFrame or a Series with the method [`DataFrame.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html). For example, we can specify that our "shipping" feature is categorical:

In [31]:
packages_df.shipping = packages_df.shipping.astype('category')
packages_df.shipping

0    standard
1      urgent
2     tracked
3     tracked
4      urgent
5    standard
6    standard
Name: shipping, dtype: category
Categories (3, object): ['standard', 'tracked', 'urgent']

 [`DataFrame.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) can also be used to convert to other, more conventional types. In our case, converting to `int` looses precision, but sometimes it might be useful if integer values were loaded as float:

In [32]:
packages_df.height.astype('int')

0    1
1    0
2    2
3    0
4    0
5    0
6    0
Name: height, dtype: int64

## Analysing the dataset

Similarly to NumPy, Pandas has a set of functions used to summarise and manipulate the data, such as [`DataFrame.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html), [`DataFrame.max()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html) and [`DataFrame.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html).

However, with Pandas, we can apply these more intuitively on different features, without keeping track of feature names ourselves like in NumPy.

For categorical features (like `'shipping'`), it might be useful to count the occurances of each of the different values of the feature. We can use the function [`DataFrame.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) for this:

In [33]:
packages_df.shipping.value_counts()

shipping
standard    3
tracked     2
urgent      2
Name: count, dtype: int64

Let us, for now, select only the packages sent with `standard` `shipping` to analyse further:

In [34]:
standard_df = packages_df.loc[packages_df.shipping == "standard", 'height':'length']
standard_df

Unnamed: 0,height,width,length
0,1.23,2.17,0.84
5,0.83,1.02,0.51
6,0.73,0.91,1.1


We can get the _mean_ (similar for _sum_, _mode_...) of a feature using the function [`DataFrame.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html):

In [35]:
standard_df.height.mean()

0.93

Also similarly to NumPy, we can get the mean across all the features. However, with Pandas, the output is visually interpretable:

In [36]:
standard_df.mean()

height    0.930000
width     1.366667
length    0.816667
dtype: float64

However, imagine that we need to get the mean of every feature, for every type of shipping. It could be quite tedious to repeat the above steps for `tracked` and `urgent` (and what if the categorical feature had many more possible values?)

Fortunately, we can use the Pandas function [`DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to get the statistics _grouped by_ the values of the _shipping_ feature:

In [41]:
packages_df.groupby('shipping', observed=False).mean()

Unnamed: 0_level_0,height,width,length
shipping,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
standard,0.93,1.366667,0.816667
tracked,1.415,1.045,1.515
urgent,0.17,0.54,0.185


It is worth mentioning that the mean does not exist for categorical values. Instead, we can talk about _mode_, the most common value, which we can obtain by [`DataFrame.mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html):

In [100]:
packages_df.shipping.mode()

0    standard
Name: shipping, dtype: category
Categories (3, object): ['standard', 'tracked', 'urgent']

We can also obtain **more complex statistics** from Pandas. For example, a **correlation matrix** tells us about the relationship between features:
- a correlation of 1 means the features are exactly the same (or just scaled by some factor)
- a correlation close to 1 means the features are strongly correlated: for bigger values of one feature, we will obtain bigger values of the other feature
- a correlation close to 0 means there features are mostly _uncorrelated_
- a correlation close to -1 means the features are _strongly negatively correlated_: for bigger values of one feature, we will obtain smaller values of the other feature

We can obtain the correlation matrix between all features in a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) using [`DataFrame.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

In [102]:
packages_df.iloc[:, :-1].corr(method = 'spearman')

Unnamed: 0,height,width,length
height,1.0,0.928571,0.821429
width,0.928571,1.0,0.678571
length,0.821429,0.678571,1.0


## Missing values

One of the strongest features of Pandas is dealing with missing values in the dataset.

When creating a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) from a dictionary, we can use [`numpy.nan`](https://numpy.org/doc/stable/reference/constants.html#numpy.nan) to define missing values. However, when working with a dataset, these might simply be values that are not there.

Let's look at this example of a dataframe describing the locations of reported potholes in a city.

In [43]:
# example from https://www.practiceprobs.com/problemsets/python-pandas/dataframe/potholes/
potholes = pd.DataFrame({
    'length':[5.1, np.nan , 6.2, 4.3, 6.0, 5.1, 6.5, 4.3, np.nan, np.nan],
    'width':[2.8, 5.8, 6.5, 6.1, 5.8, np.nan, 6.3, 6.1, 5.4, 5.0],
    'depth':[2.6, np.nan, 4.2, 0.8, 2.6, np.nan, 3.9, 4.8, 4.0, np.nan],
    'location':pd.Series(['center', 'north edge', np.nan, 'center', 'north edge', 'center', 'west edge',
                          'west edge', np.nan, np.nan], dtype='string')
})
potholes

Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
1,,5.8,,north edge
2,6.2,6.5,4.2,
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
5,5.1,,,center
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge
8,,5.4,4.0,
9,,5.0,,


The summary of the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) will also tell us the number of "actual" (non-null) values in every column:

In [44]:
potholes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   length    7 non-null      float64
 1   width     9 non-null      float64
 2   depth     7 non-null      float64
 3   location  7 non-null      string 
dtypes: float64(3), string(1)
memory usage: 452.0 bytes


We can get the _positions_ of all the undefined values in our dataframe with [`DataFrame.isnull()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html).

To get the total number of all undefined values, we can use [`DataFrame.isnull()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html), which gives us the _positions_ of all the undefined values, followed by [`DataFrame.sum()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) (repeated once for per-column summary, or twice for a total number) to count all the positions with undefined values:

In [45]:
potholes.isnull()#.sum().sum()

Unnamed: 0,length,width,depth,location
0,False,False,False,False
1,True,False,True,False
2,False,False,False,True
3,False,False,False,False
4,False,False,False,False
5,False,True,True,False
6,False,False,False,False
7,False,False,False,False
8,True,False,False,True
9,True,False,True,True


### Removing rows with missing values

The simplest way to handle missing values, if we have plenty of data, is to ignore all the samples with any missing values.

This can be done with [`DataFrame.dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html):

In [47]:
display(potholes.dropna(inplace = False))


Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge


If data is scarce (and it sometimes is), we could instead chose to remove all the rows with less than `thresh` defined values, where `thresh` is a parameter of the  [`DataFrame.dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method.

For example, we can remove all the rows which have less than half of their values defined (in our case, 3 or more undefined values):

In [48]:
potholes.dropna(thresh = potholes.shape[1]/2, inplace = True)
display(potholes)

Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
1,,5.8,,north edge
2,6.2,6.5,4.2,
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
5,5.1,,,center
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge
8,,5.4,4.0,


### Imputation

Imputation refers to when we instead replace the undefined (NaN) values with some fixed value. This might be a statistic calculated from all other available values.

For example, let us replace all missing values in the `length` column with the [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) value of this column.

First, let us get that one column ([`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)) into a new variable, and calculate the mean:

In [49]:
length = potholes.length
l_mean = length.mean()

print(length, l_mean)

0    5.1
1    NaN
2    6.2
3    4.3
4    6.0
5    5.1
6    6.5
7    4.3
8    NaN
Name: length, dtype: float64 5.357142857142857


Then, we can use the [`Series.fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dropna.html) function to replace the missing values with the mean (there is an equivalent [`DataFrame.fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) function.

Notice how using `inplace = True` changes the _whole original_ [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). This means that the above operation (selecting a column) does not _copy_ the column, it just creates a _reference_ to it (which can still change the original object it is a part of).

In [50]:
length.fillna(l_mean, inplace = True)
potholes

Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
1,5.357143,5.8,,north edge
2,6.2,6.5,4.2,
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
5,5.1,,,center
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge
8,5.357143,5.4,4.0,


We could also combine all this into a single line; in this case we will be filling all the undefined "location" values with the mode using the [`mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) function (the most common location appearing in the dataset):

Careful, since there might be multiple elements which are the most common in a certain column, [`mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) function returns a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) or a [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), therefore we need to access the first row of the output of [`mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) which we can do with `.loc[0]`

In [121]:
potholes['location'].fillna(potholes['location'].mode().loc[0], inplace = True)
potholes

Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
1,5.357143,5.8,,north edge
2,6.2,6.5,4.2,center
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
5,5.1,,,center
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge
8,5.357143,5.4,4.0,center


To do this for multiple features at once, we can instead use indexing with [`Dataframe.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and pass both feature (column) names:

In [34]:
potholes.fillna(potholes.loc[:, ['width', 'depth']].mean(), inplace = True)
potholes


Unnamed: 0,length,width,depth,location
0,5.1,2.8,2.6,center
1,5.357143,5.8,3.271429,north edge
2,6.2,6.5,4.2,center
3,4.3,6.1,0.8,center
4,6.0,5.8,2.6,north edge
5,5.1,5.6,3.271429,center
6,6.5,6.3,3.9,west edge
7,4.3,6.1,4.8,west edge
8,5.357143,5.4,4.0,center


## Reading from and writing into a file

Pandas actually knows how to work with many types of input files. However, we will here focus on `csv` files, often used to store datasets.

Here, and in the following workshop, we will work with the reduced dataset of [Algerian Forest Fires](https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++) (please download the file from BlackBoard). The dataset describes the weather conditions in Bejaia region of Algeria, and the presence or absence of Forest Fires. We will look into the details in the workshop.

_Abid, Faroudja, and Nouma Izeboudjen. "Predicting forest fire in algeria using data mining techniques: Case study of the decision tree algorithm." International Conference on Advanced Intelligent Systems for Sustainable Development. Springer, Cham, 2020._

A file can be read by the [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

Since it is likely very big (as you can check with [`DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) it might be more convenient to examine the first few entried with the [`DataFrame.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) function:

In [122]:
fires_df = pd.read_csv('fires_bejaia.csv')
print(fires_df.shape)
fires_df.head(7)

(122, 15)


Unnamed: 0,ID,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
0,0,1,6.0,2012.0,29.0,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,1,2,6.0,2012.0,29.0,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,2,3,6.0,2012.0,26.0,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
3,3,4,6.0,2012.0,25.0,89.0,13.0,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
4,4,5,6.0,2012.0,27.0,77.0,16.0,0.0,,3.0,14.2,1.2,3.9,0.5,not fire
5,5,6,6.0,2012.0,31.0,67.0,14.0,0.0,82.6,5.8,22.2,3.1,7.0,,fire
6,6,7,6.0,2012.0,33.0,54.0,13.0,0.0,88.2,9.9,30.5,6.4,10.9,7.2,fire


We can already see from these 5 samples (but you can check by examining more), the `ID` attribute is unique and can be used as an index.

We can tell this to pandas while reading from a file like this:

In [124]:
fires_df = pd.read_csv('fires_bejaia.csv', index_col = 'ID')
fires_df.head(3)

Unnamed: 0_level_0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,1,6.0,2012.0,29.0,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,2,6.0,2012.0,29.0,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,3,6.0,2012.0,26.0,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire


Finally, to write a dataframe to a `csv` file, you can use [`DataFrame.to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html):

In [125]:
fires_df.to_csv('new_fires.csv')

In the workshop, we will further analyse the data about forest fires.