Pandas
------------------
In this notebook, I will try to give you an overview of the main features of pandas for data analysis, generally, I like to think of pandas as being kind of like numpy on crack. (Although sometimes numpy is more efficient in the backend, pandas is always easier to understand). However, it should be noted that pandas is built on numpy, it simply makes the matrix that is a numpy ndarray into a table (or dataframe), that is easier to visualise.
Anyways, now that the introducing pandas is done, you can import pandas (by convention, it is imported as pd).

P.S.: If you're not running this in a cloud based environment, don't forget to run $pip\space install\space pandas$ in your terminal before hand (if you're using pip3, first, ask yourself what you're doing with your life? and then replace pip in the command by pip3). If you want to run the pip command directly inside jupyter, you can add a python cell, and add an exclamation mark at the beginning of the command and run the code cell (like this: $!pip\space install\space pandas$).

P.P.S.: Before you do anything else, figure out how to enable line wrapping in whatever notebook environment you're using (thank me later).

In [None]:
#import pandas in this cell you might also want to import numpy for some things, but you won't necessarily need it

#You will also want to import the display function from IPython.display (you will see why)
from IPython.display import display

Now that you have imported pandas, it is time to get familiar with the two main classes introduced by pandas: those being the pandas DataFrame, and the pandas Series. 

A DataFrame can simply be though of a as a table with column and row labels. DataFrames can either be imported from a csv (or Excel) file, or they can be created manually from within your code with the function $pd.DataFrame()$. 

Series are more subtle and you don't quite need to understand how they work for the moment, but you will definitely get a feel for how they work by the end of this

Furthermore, it should be noted that for the sake of compatibility with NumPy, pandas Series and DataFrames are both considered array-like objects, meaning most NumPy functions which accept an ndarry will also accept a Series or DataFrame.

In [None]:
#For now, you will simply create a data frame from 6 lists (you will have to do something to them before calling pd.DataFrame), just for getting a feel of how pandas works, we will use a legit data set later. The DataFrame shoulw have 6 columns named "col_1" through "col_6", and the rows should be "row 1" to "row 6" (I will give you index and column lists)  
list1=[4 ,13 ,23, 6, 7, 4]
list2=[2, 15, 19, 20, 6, 1]
list3=[12, 14, 1, 7, 3, 22]
list4=[7, 6, 16, 12, 0, 6]
list5=[25, 20, 3, 17, 21, 20]
list6=[6, 6, 5, 0, 0, 14]
list_of_lists=[list1, list2, list3, list4, list5, list6] #making a list of the lists, you never know, it might be useful
index_list=[f"row {i}" for i in range(1,7)] #too lazy to actually type the whole thing out so I'm using an iterator but trust me it contains the right strings
columns_list=[f"col_{i}" for i in range(1,7)]

"""Use the pd.DataFrame() function to create the dataframe, this function takes 3 (important) arguments:
    1. data: can either be a numpy ndarray, iterable, dictionnary, or another DataFrame (or Series)
    2. index: can be an index object or an array-like, this is what defines the index (row) names of your output DataFrame
        Note that if data is already an indexed object (like a Series or DataFrame), it uses that index by default if no index is passed. Otherwise, the default is range(n), where n is the number of lines in your DataFrame
    3. columns: can be an index object or array-like, this defines the column names of your output
        Defaults kind of like index would, except if data is passed a dictionnary, then dict.keys() is the default
    These are all positional arguments, so no need to specify the keywords if you don't want to"""

#Now insert code to create the DataFrame (and store it into a variable called data), there is more than one way to do this, you can try thinking of more than one or not, I don't really care (and even if I did, I'm not watching over your shoulder).



#Now that you have created your data frame, you can use the display function which we imported from IPython.display earlier to display the DataFrame nicely (if you want to see why I had you import this, call print(data), and see the difference):
display(data)

Now, let me go over a few useful things pandas let's you do (as well as how DataFrames are indexed) in this cell, and I will then have you practice those things in the next code cell. (note that for methods and attributes of a DataFrame, I will use the structure df.attribute, where df is the name of my dataframe)

    1.  Indexing and slicing a DataFrame: 
A DataFrame has 2 attributes which can be used for indexing: $df.columns$, and $df.index$, which return a pandas column and pandas index object respectively (these can be thought of as numpy arrays), these can be useful to loop over a DataFrame. Now, in our prior example, there are a few ways to isolate a column from the DataFrame, you either call $data["col\_1"]$ or $data.col_1$, both of which returns a Series (which can be thought of as a single column or line from the DataFrame). Note that the first way is almost always preferred, because the second way only works if your column names do not contain any spaces (or periods, or other space like characters). Similarly, if you want to isolate a row, you call $data.loc["row\space 1"]$, where .loc tells pandas that you're searching for rows instead of columns (This isn't completely true, but it's good to think of it that way to learn). You can combine the two to reach a single cell with an instruction of the form $data["col\_1"].loc["row\space 1"]$ (order doesn't matter). Now, what if you for some reason, want to return a pandas DataFrame instead of a Series, or what if you want to isolate two or more rows or columns? In both cases, the answer is Slicing, to slice a dataframe, you pass it a list of row or column names, for example: $data[["col\_1", "col\_5"]]$ returns a DataFrame with only col_1 and col_5. Finally, if for some reason you know a row's numerical index, but not their name, you can use $.iloc[row, col]$, where iloc stands for index location, the indexing inside $.iloc[]$ works the same as for numpy ndarrays so I will assume you know how it works. (Note, if you want to be very efficient, and you only need one value from the data frame, use $df.at[]$, in our example, $data.at["row\space 1", "col\_4"]$ returns 6, $df.iat[]$ works like at, but with index values) (note number 2, $df.loc[]$ can actually take a row and column name (separated by a comma like iloc), I wouldn't recommend that at first though as it is easier to see what you're doing when using different ways to get rows and columns)
    
    2.  Getting a quick idea of what your DataFrame looks like:

1. $df.head(n=5)$ or $df.tail(n=5)$

Say you have a dataframe with 1000 lines, and you want to get an idea of what kind of data you have without having to display the full dataframe, one of the main ways to do that is by using the $df.head()$ method. This method takes one argument, $n$, which is $5$ by default, and displays the $n$ first lines of the DataFrame the method is called on. Similarly, $df.tail()$ returns the last $n$ lines of the DataFrame

2. $df.dtypes$

This attribute returns a dtype object (don't worry too much about what that is), which is essentially a table of what type of data each column contains

3. $df.value\_counts(ascending=False, normalize=False)$

This method's name is pretty explicit as to what it does, the parameters allow you to set the counts to be displayed in ascending or descending order of occurrences, and $normalize=True$ displays the counts as a percentage of the total instead of simply the number of occurrences

4. $df.describe()$

This method returns some statistical information about each column of the dataframe, including but not limited to the number of unique values, mean, standard deviation, max and min values, and others, for text columns, this behaves differently, and returns the  number of unique values, the most frequent value and its frequency.
        
5. $df.columns$ and $df.index$

I have already quickly mentionned these in the indexing section, but I will add them here as well, as getting the list of column and row labels can help you get a quick idea of what the dataframe looks like

Now, time for a bit of practice before some more info
------------------
Slicing and indexing 

In [None]:
"""For this section, we will still be using the dataframe you have previously created: 
First, let's practice slicing a DataFrame:"""
#Add code to isolate col_1 here (you can display it to check that your code works) (and store it in data1)
data1=

#Now add code to isolate row 3 here:
data2=

#Now add code to isolate the boxes that are part of the last 3 columns and the last 4 rows
data3=

#Challenge: try printing the elements along the diagonal  (If you are inexperienced in coding, I do not expect you to be able to solve this in a way that would be feasable if you had 1000 rows of data instead of 6, though I will not include such a way to do this in my solutions)



Now, let's practice a bit of previsualizing data.

In [None]:
#Insert code to display the first 2 lines of data here (Slicing is possible but this is not the idea)


#What about if you want to display the last 3


#What are the datatypes of the column col_3 of data 


#What is the most frequent value of column 5, how many times does it appear (you will need to slice the DataFrame, check what happens if you don't, ask yourself why)


#What are some global stats about each column of data


#I have provided a new DataFrame with a text column underneath, try calling the same method onto that, see what happens
new_dataframe=pd.DataFrame(["Europe", "North America", "Asia", "Europe"], columns=["Continent"], index=["France", "Canada", "Japan", "Spain"])



Now that you have an idea of how to get a rough feel for what kind of data you're dealing with, let's go a bit more in depth into actual data analysis, mainly, how to import a dataset into python, how to add and remove elements from a DataFrame, and how to compute some significant statistical quantities about your DataFrame. Generally, this next section should give you a better feel for how to deal with a DataFrame for your own statistical projects. To teach you this, I have choosen to use the Iris dataset (which I have edited slightly to be able to teach you a few more things). I choose this dataset, as it is often used as an example for data analysis in python, it was the first dataset I manipulated pandas DataFrames on, the dataset I learned most of my matplotlib.pyplot, seaborn and plotly.express, and the dataset I wrote my first basic machine learning program on, point is, this is a well built dataset, and I like it.

First and for all, you will want to import the dataset into this notebook from the file you downloaded with it. To do that, use pd.read_csv(), a pandas function which takes a few arguments and returns a DataFrame. The import arguments you will want know about are: 

1.  $filepath\_or\_buffer$: the first and only positional argument, this is the path to the csv file you want to read, to be passed as a string (in our case, "iris.csv", note that if the path included a folder name, you would have to use \\\\, instead of just \\, because "\\" is the escape character)
2.  $delimiter$: this is a string containing the character that separates each element of your csv file, by default, it is "," the default csv separator in the Americas, if you are coding on a device from a European country, more specifically, one that uses a comma instead of a the decimal point, you will most likely have to specify delimiter=";"
3.  $sep$: fully equivalent to delimiter (programers are lazy, and sep short for separator is faster to type than delimiter)
4.  $decimal$: the decimal separator used, by default it is ".", which is what we want, once again, if your device (or you when you input the data) uses a decimal comma instead of a decimal point, you will have to specify $decimal=","$
5.  $index\_col$: if you want to use one of the columns from your csv file as an index, pass that column's name here, otherwise, the function will simply generate integers from 0 to the number of rows-1 (by calling $range(n)$, were n is the number of rows), to serve as the DataFrame's index. In our case, we will want to use $index_col="index"$.

There are more arguments you can pass, but these are the main ones

Call $pd.read\_csv()$ here to import your dataset. In the next code cell.

I won't go over the ways to get a rough idea of a dataset agains as I have mentionned them earlier, but I would recommend practicing them (and I will include them in the solutions)

In [None]:
#assign the dataset to the iris variable
iris=

#If you want to try getting a feel for the data, feel free to here

Preprocessing the data
--------------------------------------------------

As you have just seen, before doing anything else, we will need to transpose our DataFrame, to make it easier to work with, there are plenty of reasons why the orientation of a DataFrame matters when working with it, but the main one is that data types are uniform within a column, but not within a row, meaning having text columns and number columns is possible but not text rows and number rows. 

There are two ways to get the transpose of a DataFrame:

    1. Pandas DataFrames have a $.T$ attribute which returns their transposed version, hence, reassigning iris, by calling $iris=iris.T$ would transpose the DataFrame

    2. DataFrames also have a $.transpose()$ method which returns the same object. By convention, this way is used when reassigning the DataFrame, while the $.T$ attribute is used to perform calculations.

Once you have obtained the transpose of your dataframe, here are two more methods you could use in the next cell (though there is a solution which involves using neither)

1. $df.sort\_values(by,\space axis=0,\space ascending=True,\space inplace=False)$

This sorts the values in df along a given axis (0 or "index" if you want to keep the column order the same and sort the rows, 1 or "columns if you want to keep the rows in place and sort the columns) based on the values in the by column, for numerical columns, ascending order is used when $ascending=True$ and descending order is used when $ascending=False$. For string columns, alphabetical order is used when ascending is True and reverse alphabetical order otherwise.

The $inplace$ parameter is more subtle, if $inplace=False$, this method returns a sorted dataframe without making any changes to the dataframe it is called on. If $inplace=True$, than the method replaces the dataframe it is called on by the sorted dataframe and returns None.

2. $df.reset\_index(drop=False, inplace=False)$

This simply resets the index of your DataFrame to a list of integers starting at 0 and increasing by 1 every row. 

$inplace$ works exactly like in $df.sort\_values()$, and $drop$ indicates whether to drop the previous index ($drop=True$), or to add it to the new dataframe as a column ($drop=False$).

Note that it is never mandatory to reset your index, and it's usually more of a quality of life thing

One last thing that might be useful before you get back to writing code, and which I promised I would go over later: Boolean indexing. The idea of boolean indexing is that you pass an array of booleans of the same shape as your dataframe (or a single row of the same length as your columns, or a single column of the same length as your rows) as an index, and it returns a dataframe where it keeps only the values where your array of booleans (also called a mask) is True, and remove the places where it is False.

I will go over how to do this quickly, as it works exactly the same as in numpy, but say you have a dataframe and want to show only rows where the value in a particular column is greater than 6. This kind of situation is where you would use boolean indexing:

First you create a mask, for example, with our previous dataframe, $mask=data["col\_2"] > 6$ creates a pandas Series (for our purposes, it serves as an array-like), of booleans. Boolean operators work on dataframes, and series like they do on numpy arrays, if you don't know how they work on numpy arrays, they return a dataframe or series of the same shape comparing each element to the other value that is passed (here that would be 6) and return the appropriate boolean.

Then you pass the mask as an index to your dataframe. In our example, you use $data[mask]$. This returns a row with only the spots where col_2 is greater than 6. You could've done this in one step with $data[data["col\_2"] > 6]$.


Now that you know all of this, you can practice by transposing the iris dataframe, and then splitting it into three dataframes (one for each flower type). Follow the instructions in the code cell for more details

In [None]:
#Only transpose if the data hasn't already been transposed
if "Class" not in iris.columns: #This is a sanity check in case you run the cell multiple times in a row, I'm including it this time, but in the future, you should try to think when it would be pertinent to include one.
    #Transpose here
    iris=

#Before splitting, we have to get the Data types right, we do this by looping over columns and using ndarray.astype() (a numpy method, which luckily works on dataframes and series, there might be a df.astype() method in pandas, I didn't double check, but if there is, the backend code simply calls the numpy function), however, we need different types for the Class column (which is why we have to loop, and add a conditional statement in the loop). I have given you code for this as this issue only arises from me messing with the dataset to teach some other things
for i in iris.columns:
    if "Class" != i:
        iris[i]=iris[i].astype(float) #Set the type to float
display(iris.dtypes) #This is a sanity check to see if the type really was changed to float

#Split the DataFrame into three parts. One for Setosa, one for Versicolor, and one for Virginica, you will have to reassign iris

setosa=

versicolor=

virginica=

#I might not have given you enough space, but this is just me giving you some variable names, so they match the ones in my solutions, don't take this as an indication of the space your code should take

Some statistical quantities
-----------------

In this section, we will add (made up) error values as columns to each dataframe, and add rows to with some statistical quantities about every (number-valued) column.

First, let's go over how to add rows and columns to a dataframe. For this, dataframes work almost like dictionaries, in the sense that assigning a value in a row that doesn't exist or in a column that doesn't exist simply creates a new column. As a reminder on our original example, $data["col\_1"].loc["row\space 1"]=0$ assigns the value 0 to the spot that is in col_1 and row 1 of the data dataframe. Thus, $data["col\_7"].loc["row \space 7"]=5$ creates row 7 and col_7 and assigns 5 to the slot at their intersection, all spots that weren't assigned in the process will simply be filled by NaN values. 
You can also assign a full row (or column) at once, for example: $data.loc["row\space 1"] = data.loc["row\space 2"] * 2$ takes the values of row 2, multiplies them by 2, to create a new row of the right size and assigns that to row 1. If instead of "row 1", we had instead indexed "row 8", a new row would have been created.

Now that you know how to add rows and columns, you can add four columns into your split dataframes, each of those columns containing error on the value measured on a column. Since I have no idea how that data was measured, and since what the real error value really isn't the point of this, you can use the following formula for computing the error on a value: $error=value\times0.05+0.1$ (I won't go over how operations work on dataframes as it is the same as numpy arrays)

In [None]:
#Add error columns into the DataFrame
#If you choose to do this by looping over the columns of your data frame, here are a few things you should consider
#First, the "Class" column shouldn't be included in your loop as trying to perform operations like multiplication on a string will return an error. To avoid this, you can add a check that the current column's name isn't "Class" (actually, it's better to check if "Class" is in the name, in case the column names have spaces you aren't aware of). You could also check that the data type of the current column in the loop is a numeric type, to do this, you can use the pd.api.types.is_numeric_dtype(df) function, which takes a dataframe as an input, and returns True if the data type of that data frame is int or float, and False otherwise. However, this second function is out of the scope of what I'm trying to teach you in this notebook
#Second, you should consider what happens if you run this cell twice in a row. Say a code cell edits a dataframe called setosa, after the cell is done running, the dataframe stored in memory, is now the edited version after the cell has run, meaning if you run the cell again, it will then edit the modified dataframe again. Meaning that if you add an error column on each column of your dataframe, and run the cell again, it well then add error onto the error columns, we don't want that. Therefore, you should add a check that the column you're adding an error column for isn't already an error column before adding error on it. (Checking that the word error isn't in the column name is a good way to go about this) (This is the last time I'm reminding you to add such a sanity check)
#If you choose to do this without looping, reconsider your life choices as that is tedious for no reason.




display(setosa.head())
display(versicolor.head())
display(virginica.head())
#displaying the head of each dataframe is a good idea as it helps you get a quick idea of whether you did what you wanted correctly or not

The next thing you will want to do is add a row for the mean of each column, one for the standard error on the mean, one for the variance and one for the standard deviation.

To do this, you will call 4 pandas methods, which happen to all have almost exactly df.mean(), df.sem(), df.var(), and df.std(). The first one has a slightly different structure than the other three, but these have exactly identical structures. Let's go over them:

    df.mean(axis=0, numeric_only=False): 

        This takes 2 inputs, the axis along which you want to compute the mean, as usuall, this can be 0 or "index" for rows (meaning it returns a row of the mean of each column of the df), and 1 or "columns" for columns (thus returning a column). The numeric_only tells the function whether it needs to ignore columns with non-numeric data types (and fill those spots with NaN), in our case, specifying numeric_only=True will be necessary because of the "Class" column.

        Note that df.max(), df.min(), df.mode(), and df.median() also all exist, and all have exactly the same structure as df.mean()
    
    df.sem(axis=0, ddof=1, numeric_only=False) (sem stands for standard error on the mean)

        This  function takes 2 of the same inputs as df.mean(). The other parameter, ddof (stands for delta degrees of freedom), you might be familiar with if you have used the scipy.stats.chisquare() function. The function computes the standard error on the mean assuming 
$N-ddof$ 
    
    elements, (where N is the number of rows), it is 1 by default, because for some types of statistics, you take the population to be 1 less than your sample, but for our purposes, you will have to set it to 0.

        The df.var(), and df.std() methods take exactly the same inputs and return the variance and standard deviation

In [None]:
#Add rows for the mean, standard error, variance and standard deviation to each DataFrame


#Displaying the tail of each dataframe (specifically the last 4 rows is good practice to check we added what we wanted)
display(setosa.tail(4))
display(versicolor.tail(4))
display(virginica.tail(4))

However, we now run into a problem , our columns are very much out of order, we would usually want to have a column followed by its error, and so on. To fix this, we reindex the dataframe. There are $2$ ways to reindex a dataframe, we can either slice the dataframe, and reassign the variable. Or use the $df.reindex()$ method. I will show you both ways and let you choose your favorite. 

Either way, we need a list of the new order of columns we want (if we want to drop some columns, this could omit some of the original columns). Let's called that list ordered\_list. Once we have that list, we can choose our favorite method amongst the following.
1. Slicing:

Do you remember when I told you you could slice a dataframe by passing a list of columns as a column index to it? If you don't here is a refresher, it looks something like this: $dataframe[["col\_1", "col\_2"]]$. However, what I haven't told you, is that slicing actually returns the slice in the specified order, meaning $dataframe[["col\_1", "col\_2"]]$ and $dataframe[["col\_2", "col\_1"]]$ have different outputs. Thus if you want to reorder your index, you simply need to reassign your dataframe with an instruction of the form "dataframe=dataframe[ordered\_list]"

2. $df.reindex(labels=None,\space axis=None)$:

This function has two important parameters, $labels$ is the list of row or column labels in the order you want, and $axis$ indicates whether you want to reorder along rows or columns (yes, this can also reorder rows). As usual, use $axis=0$ or $"index"$ for rows and $axis=1$ or $"columns"$ for columns. Like with the first method you will have to reassign your dataframe (i.e.: use code of the form $df=df.reindex(*args)$)

Now you can reorder the indexes of your dataframes with the method of your choice. I have provided the ordered list.

In [None]:
#Reorder your column labels
#Here is the ordered labels list:
ordered_list=["Petal_Length", "Error on Petal_Length", "Petal_Width", "Error on Petal_Width", "Sepal_Length", "Error on Sepal_Length", "Sepal_Width", "Error on Sepal_Width", "Class"]

#Change the index order here

Now, you might also be wondering what we are planning to do with the NaN values in the $Class$ column in the $mean$, $standard\space error\space on\space the\space mean$, $variance$, and $standard\space deviation$ rows. We could simply fill them by looping over the rows of each data frame and imposing that the cell in that row and in the $Class$ column is the name of the flower we want. We could even leave them as is since there is no ambiguity as to which is which. Or we could also fully drop the column from our dataframes (since we don't really need it anymore now that the dataframes are separated by flower type)

In any normal situation, the third option is the one that makes the least sense. However, it's the one we will use as it gives me an opportunity to teach you about how to drop columns from a dataframe using $df.drop()$. (Of course, it would also be possible by slicing like we did to reindex, but that becomse tedious if you start having many columns)

$df.drop()$ takes 3 arguments:

1. $labels$: either a string containing the single label to drop, or a list of labels to drop

2. $axis$: the axis along which you want to drop the label. This works as usual, use $0$ or $"index"$ for rows and $1$ or $"columns"$ for columns.

3. $inplace$: This is a boolean, False by default. Whether to replace the original dataframe with the one with dropped  labels ($inplace=True$), or to return that dataframe ($inplace=False$) 

Now, you can drop the $Class$ column from your dataframes

In [None]:
#Drop the Class column from your dataframes


Visualizing data
----------------

Although matplotlib.pyplot is usually the go to for visualizing data, pandas does offer an alternative through the method df.plot(), and it's deviations. This is what we will be going through in this next section. We will first go over how to get an idea of what the distribution looks like, by using kernel density estimators (kde), also known as density plots, as well as histograms and hexbin plots. Then we will go over ways to visualize the possible correlation of two variables by using scatter plots, and finally, we will quickly go over the other kinds of plots that can be made using df.plot().

First, we will note that df.plot() has many arguments, even in the pandas documentation, it is introduced as df.plot(*args, **kwargs), which is generally a sign of many arguments, so we will introduce them as we go through this section. (Note that we won't focus too much on the keyword arguments, as they are mostly there for matplotlib compatibility)

The first 3 arguments we will go through are by far the most important, so pay attention, here they come:

1.  x: this is the column name (or list of column names) of your dataframe, that you want to see graphed on the x axis.

2.  y: this is the column name (or list of column names of the same length as x), that you want to see graphed on the y axis.

3. kind: this is a string containing the kind of plot you want to make, it can be something like "kde" or "density" or "hist", and a few others. This is also when we introduce another idea, indeed, df.plot has too forms: it can either be written as df.plot(kind=kind, *args) or as df.plot.kind(*args). In other words: df.plot(x, y, kind="scatter") and df.plot.scatter() are equivalent. The only difference is that df.plot.scatter() will return an error if you specify an argument that is only pertinent for say a histogram, while df.plot(kind=scatter) will simply ignore that argument.

The first kind of plot we will go over is the kernel density estimator (kde from now on), which can also be referred to as a density plot. You probably won't have heard of this kind of plot before, however, it is a very powerful tool, the goal of a density plot is to estimate a variable's probability density function (usually assuming some form of gaussian distribution or sum of gaussian distributions). This can instantly give you an idea of how spread out a dataset is and is therefore a pretty powerful tool. I will also use it as an excuse to introduce a few more parameters.

A kde plot can be obtained with kind="kde", or kind="density" (or df.plot.kde()/df.plot.density()).

When making a plot that only requires one of x or y to be specified (such as a kde plot), it is important the you specify the column names you want under the y parameter.

Finally, it's time to introduce a few parameters:

First, let's talk about the figsize parameter. By default, it is set depending on your matplotlib parameters, and that's usually a decent enough figsize, however, sometimes, specifying figsize can be useful (especially for subplots). figsize accepts a tuple (length, height) of the length and height of the figure (both are in inches)

Then, the title parameters takes a string which will be the title of your figure.

Now, say you want to make a kde plot for your setosa dataframe, by using y=["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width"]. By default, all of these plots will appear on the same plot. If you wanted them each on a unique plot, you would have to use four different calls of setosa.plot(), which is tedious. There is an alternative; if you know about matplotlib, you might have heard of a subplots object, which is essentially a grid of plots. Well df.plot() accepts a boolean argument called subplots, by default it is set to False, but if you set subplots=True, each x, y column name pair will appear on its own subplots.

If you specify subplots=True, here are a few more parameters you might want to know about:

1. layout: accepts a tuple (a,b), where a is the number of rows in your grid of subplots and b is the number of columns, by default, layout=(len(y),1), which is usually not optimal, as if you are plotting 6 elements, usually, having a $2\times3$ grid looks better than a $6\times1$.

2. sharex and sharey are both booleans, False by default, setting sharex to True forces all the subplots to have the same x axis, you would usually do this for plotting residuals for example. The same idea goes for sharey.

Now that you know the basic idea for a kde plot, you can create a density plot (use subplots, use your judgement to figure out the layout) of the 4 data columns (for one of the flowers of your choice), since there is no reason for the x and y axes to be very different from one subplot to the next, you might want to make them shared to be able to compare more easily. Finally, you should also try setting figsize (in the case of a subplots object, figsize is the size of the whole figure, not of each subplot).

In [None]:
#Kernel density estimators plot on setosa


Now, you might be wondering what would happen if you wanted to do have data from multiple different dataframes on the same plot. This is where the next parameter comes in: the ax parameter allows you to specify a preexisting axes object you want the knew plot to appear on. 

The way you would use this is by when you first call df.plot(), assigning the output of that to a variable. That variable will then be your axes object. Here is an example:

axes=df1.plot(*args)

df2.plot(ax=axes, *args)

I only included the basic structure of the ax argument without any of the other arguments here to give you an idea, but it should be clearer in the next exercise. 

Finally, you might be wondering what happens to the labels that were previously just the column name, well as usual, there is a parameter for that. By specifying the label parameter, (a string), that argument will be used in the legend for the corresponding (x,y) pair. Meaning if x and y are lists, you should also specify a list of labels. While we are at it, I might want to tell you about the legend argument, by default, legend=True, if legend=True, then, the legend is shown, if legend=False, the legend is hidden.

You should also know that the default value for title is None, meaning that if you only specify it in your original call of df.plot() (i.e.: in our example, the line where you go axes=df1.plot(*args)), that is enough and you don't need to specify it again in later calls of df.plot() that will be applied on the same axes object. Same goes for figsize. However this is not true for legend. If you specify legend=False on your first call of df.plot(), the legend will be hidden for all the plots associated with that call will be hidden, but if in the next call you have legend=True, then the legend will be shown for those. In our example, if the first call specifies legend=False, then the df1 part of the plot won't appear on the legend, but if the df2.plot() call specifies legend=True, then the legend is shown for the df2 part of the plot. In short, legend works call by call.

Knowing that, try making a density plot of the Petal_Length column for all three flowers. 

In [None]:
#Now try doing Petal Length for each flower kind but on the same plot
#I have given you a variable to store your axes object in
stacked_kde_plots=


Now, let's talk about making a histogram, which works almost exactly like making a kde plot, except you use kind="hist", and you can specify an additional parameters, bins, which takes an integer and without much surprise, is the number of bins your histogram should have (by default it is 10).

Let's also introduce 4 more parameters (really 2 sets of 2):

1.  xlim and ylim each take a tuple of floats of the form (lower_bound, upper_bound), and specify the bounds of the axis.

2.  xticks and yticks each take a list (or array-like) of tick positions along the x or y axis, meaning if for some reason you needed irregular tick positions, or you wanted to fix the interval you would specify xticks or yticks (usually you can use the range function coupled with some clever operations to get the ticks you want). For example, if you want ticks on your x axis at -6,3,0,3,6,9; you could do xticks=np.array(range(-3,4))*3 (you have to use np.array() to convert the list range() generates into an array, as multiplication isn't defined on lists). (This works like title in it when making multiple graphs on the same axes object, you only need to specify it on the first call of df.plot())

Now that you know about these parameters, try making a kde, and histogram (use 15 bins) of the Sepal_Length column of the versicolor dataframe on the same plot (they should have roughly the same shape, since kde tries to guess the probability density of the distribution and histogram is the distribution). Does it look good? Why or why not? If it doesn't, simply make two different plots (on distinct objects), one with the kde and one with the histogram (be sure to have the same x axis on both to make comparing them easier, think of why you don't want the y axis to also be the same). 

In [None]:
#Now what if we try overlaying the kernel density estimator with a histogram of the data to check that it is reasonable, let's do this on the Sepal Length data for versicolor
histogram_overlayed_on_kde=


#If the plot looks bad, make two side by side plots instead


Another kind of plot we can make to get an idea of the distribution is a hexbin, (or hexagonal bin), the best way to think of a hexbin plot is as a two dimensional histogram, where each bin is a hexagon. When compared to a histogram, this of course has advantages and inconvenients, the main advantage being you get to see the spread of the distribution along 2 parameters instead of 1, but the tradeoff is that you lose the clear indicator of which bin has more counts than the other that is bin height and instead have to use a color scale.

As you might have guessed, a hexbin plot is obtained with kind="hexbin" or df.plot.hexbin(). There are 2 new parameters associated with this kind of plot: $xlabel$ and $ylabel$, respectively, the label for the x axis and the label for the y axis. (These parameters aren't supported for kde plots or histograms, which is why I didn't introduce them earlier). Once again, these work like title (you only need to specify them once per axes object)

Before I have you try a hexbin plot, know that this is usually used when you have a lot of data, so there isn't really much information to gain here since we only have 50 measurements.

That said, try making a hexbin plot of petal width against sepal width on the setosa dataframe.

In [None]:
#Now make a hexbin plot of sepal width and petal width for the setosa dataframe


Now we're done with getting an idea of what the distribution looks like, we can try getting an idea of how our parameters could be related, to do that, we make some scatter plots. 

We obtain scatter plots with $kind=$"$scatter$". This introduces three different arguments: $color$, $marker$, and $s$.

Let's begin with the end (I'm logical like that): $s$ is the size of the marker used. The default depends on your matplotlib settings, but sometimes it might be hard to see, because either the points are on top of one another, in which case, a small size value might be better, or because the points are too small. There is more information about this in a docstring in one of the code cells in my solution for the curious.

Then, $marker$ takes a character (or 1 digit int) as an input, more specifically, it has to be a character from a specific list, some of which include "o", "s", "^", "v", for circle, square, upwards pointing and downwards pointing triangle respectively. If you look at my solutions, for the scatter plot exercises to come, I have tried to vary the markers I use as much as possible to give you some more of the possibilities

Finally, $color$ is a string, it can either be a default matplotlib color, which includes red, orange, blue, green, black, purple, pink, orange, yellow, gray, and possibly some others. $color$ can also accept a tuple of RGB values, although, they take floating values from 0 to 1 instead of 0 to 255, so you have to convert your RGB by dividing it by 255.

Now, we can start getting an idea of the possible relations between our variables.
For example, we can test for a correlation between Petal length and width (i.e. check if the seemingly random variations of petal length correlate positively with the seemingly random variations of petal width, or if they are uncorrelated, note that this tells us absolutely nothing about a causal relation)

In [None]:
#Overlay the three scatter plots on one another
scatter_plot=

You might have noticed that an extra label appeared in the legend for this plot. The reason why the label index appears on your legend for seemingly no reason. The exact reason why this happens is complicated, but it could be summed up by the fact that the columns axis of our dataframe is named "index" (I had to do that when I messed with the dataset to be able to teach you a few more things). To fix this, you use a method that I originally hadn't planned to talk about to rename your columns axis to an empty string. Since this is out of the scope of what I'm trying to teach, I will simply give you the code, calling this on all three of your dataframes should fix the problem.
setosa.rename_axis("", axis=1, inplace=True). Replacing setosa by the appropriate variable name every time.

In [None]:
#Rename your columns axis in this cell.


Now that has been said, do you notice anything else that seemed off about this plot?

5 more lines to scroll until you get to the answer

4 more

3 more

2 more

1 more

Answer is on the next line

If you answered that there seems to be outliers, you would be right. As a matter of fact, we could have noticed these outliers when we did the kde plots and histograms, remember that small count number spike before the data began? If you squint on the hexbin plot, you might also notice them around the bottom left corner.

Now that you noticed the outliers, can you guess where they come from?

Hint: if we had done absolutely nothing with the dataset, they wouldn't exist

Hint no 2: there are exactly three outliers for each flower (but four points that don't belong, one of them just happens not to be an outlier)

Hint no 3: if you really don't know, display the tail of your dataframe. 

Now that you (hopefully) know where they come from (and ideally understand why they shouldn't be there), how do you get rid of them (ideally without dropping them from your df)? 

Does this give you any new insights as to why the kernel density estimation plots weren't quite gaussian, but looked like a sum of two gaussians instead?

Now that you have thought of a fix, for this outlier problem, try making a sepal width vs sepal length plot but without the outliers. 

You can also try using the same fix on the kde, histogram and hexbin plots you previously made. (The hexbin will still look pretty bad). You can also go back to the previous scatter plot and fix it as well.

In [None]:
#Now do sepal width vs sepal length, but without the outliers
#Use three different markers if you weren't already doing that.
scatter_plot=

Finally, let's introduce the last 4 parameters we will be discussing here: $xerr$ and $yerr$, $grid$, and $alpha$.

$xerr$ and $yerr$ are column names for the horizontal and vertical error bars. We won't really go over error bar formatting here as if you want to format your errorbars, you get to the point where you're better off using matplotlib

$grid$ is a boolean, by default, $grid=None$, but here $None$ acts like $False$, meaning if you specify $grid=True$ in one call of $df.plot()$, but leave it unspecified in the next, the second call will overwrite the first, and $grid$ will be set back to $False$. In other words, you only need to specify $grid$ in your last call of $df.plot()$

Finally, $alpha$ is a float that goes from $0$ to $1$, by default it's $1$. $alpha$ set's the opacity of the marker ($alpha=0$ is fully transparent, meaning marker doesn't exist, and $alpha=1$ is fully opaque meaning a marker will fully hide ant marker behind it)

Now that you know this, you can make a plot of petal length against sepal length. With error bars on both quantities. Disclaimer, adding error bars on this dataset is a bad idea, I am only doing this because you need to learn how to make error bars appear, but in practice, use your judgement to decide whether you want error bars or not.

Since this plot is going to look horrible anyways, you might as well add gridlines onto it.

In [None]:
#Finally, we do petal length vs sepal length but we add error bars (on this dataset the errobars will look messy, so when applying this on a "real" dataset, consider how the error bars look and decide whether or not it is worth showing them or not)
#Since this is going to be a monster anyways, you might as well try adding gridlines to the plot. Note that something funny about gridlines, is that even though, like plot title and axis labels, the defauls for the grid parameters is grid=None, here None acts like False, and will overwrite any prior grid=True, while title=None simply keeps any previously specified title.
#If you haven't yet, you might also want to try messing with the size argument
scatter_plot_with_error=

With all these plots available to you, here is one more coding challenge: write a loop to generate all 16 possible plots with those 4 data columns (don't use a subplots object). For the sake of this exercise, we consider x vs y and y vs x as different plots. If you are plotting a column against itself, make a histogram (use 5 bins this time) instead of a scatter plot (for the histograms, since they will have overlap, you will have to specify alpha).

In [None]:
#One final challenge, write a loop to generate all 16 possible plots with those four data columns (don't use a subplots object), if you are plotting a column against itself, make a histogram instead of the scatter plot. (If you go through my seaborn overview after this, you will see that using seaborn, this exercise becomes trivial)
#Don't forget to remove the mean, standard error, variance and standard deviation we computed earlier from your data before plotting this
#I have given you a list of the columns you want to loop over to work with
columns_to_loop_over=["Sepal_Length", "Petal_Length", "Sepal_Width", "Petal_Width"]


Note that I havent showed you any of these, as they are less frequent (and would make no sense here), but "bar", "barh" (h is for horizontal), "line", and "pie" are all acceptable arguments for kind, and that makes them all plots you could make using df.plot.bar() (or barh or line or pie). They work in very similar ways to the plots I have already shown you, so we will skip these for now. The only thing you really need to know is that line is essentially scatter but the markers are connected.

This concludes our overview of plotting in pandas.

Exporting data
-------------

In this part of our pandas overview, we will go over $2$ ways to export a dataframe, the first one being to a csv file. And the second one being to a LaTeX table. But before that, we will talk about how to insert a column into a specific position. Note that amongst others, it is also possible to export a dataframe to an Excel spreadsheet, except it's more complicated, and you most likely won't need to do this.

Before you export your data, for presentation purposes, it might be interesting to include the flower type in your dataframe's columns (you remember? That column we dropped earlier). Ideally, we would make that the first column in our dataframe. Of course, to do that, it is possible to simply assign a new column at the end of the dataframe, and reindex. But there is a simpler way. Using the $df.insert()$ method. This method takes $4$ arguments:

1. $loc$: The numerical index you want the inserted column to be located at in your dataframe. Remember, indexing starts at $0$.

2. $column$: A string or numerical value, this will be the inserted column's name

3. $value$: This can be a series or array-like, it's the values that the column will take

4. $allow\_duplicates$: This is a boolean. This argument is a precaution in case a cell is run multiple times in a row to avoid inserting the same column each time. Note, this argument doesn't prevent the code from inserting the column, it raises an error to avoid the column being inserted. It is still a good precaution to have, but you should still previously check yourself that the column doesn't already exist to avoid raising errors

Note: This method only works to insert columns.

In the next code cell, insert a $Flower\space Type$ column in the first position of your dataframes. I have provided the code of a loop that generates all three lists containing the flower name the right number of times

In [None]:
#Add a Flower Type column to your dataframes
#I have provided a loop to create the list of flower names the right number of times
setosa_list=[]
versicolor_list=[]
virginica_list=[]
for i in setosa.index: #since all of my dataframes have as many rows, I only need to loop over one of them, if they didn't I would need three distinct loops
    setosa_list.append("Setosa")
    versicolor_list.append("Versicolor")
    virginica_list.append("Virginica")

#Now, insert your code here


First let's go over the $df.to\_csv()$ method, which creates a csv file and stores your dataframe into it. It has a few parameters you might want to know of:
1. $path\_or\_buf$: this is the only positional argument, it is a string containing the path of the file you want to write your dataframe to in csv format, it should end with $.csv$. If the filepath specified doesn't exist, this will create a new file, if it does already exist, the old file will be overwritten. Be careful not to overwrite files you need!
2. $sep$: this is the separator to use in your csv file, by default it is a comma (i.e.: $sep=$"$,$"). Though you should change that to a semicolon if for some reason you want to use a comma as a decimal separator instead. 
3. $columns$: this is a list of the columns of your dataframe you want to include in the csv file that will be created
4. $decimal$: this is the decimal separator used, by default it is a period (i.e.: $decimal=$"$.$")

Try saving all three of your dataframes to distinct csv files in the next code cell. Only save the original datacolumns, not the error columns you have added.

In [None]:
#Save your data to csv here


Finally, we keep one of the most argument heavy methods for the end. If you don't plan on using latex, skip over this next section, but if you do, this will save you lots of time copying your data into latex. I'm of course talking about the $df.to\_latex()$ method. Note that this method has a dependency on the jinja2 module, so if you don't have it, install it with pip. This can do one of two things, return a string of latex source code that generates that table when pasted into latex. Or write that string into a file (usually, a .txt). This takes many argument, let's go over them:

1. $buf$: this is the path of the file to write to if you want to write to a file. If you don't want to write to a file and only want to return the string, leave it to the default $buf=None$.

2. $columns$: this argument works like in $df.to\_csv()$, it's the list of columns you want in your latex table.

3. $header$: This is a boolean (True by default), it indicates whether to write out the column names in the latex table or not

4. $index$: This is also a boolean (also True by default), works like $header$ but for rows.

5. $na\_rep$: This is a string ($'NaN'$) by default, missing values in your dataframe will be replaced by this string in your latex table

6. $float\_format$: This is a formatting function for rounding of floats. Here are $2$ ways you can round to n decimals: $float\_format="\%.nf"$  or $float\_format="{{:0.nf}}".format$

7. $bold\_rows$: This is a boolean (False by default). Specifies whether you want the row names to be in bold in the output.

8. $column\_format$: This is the column format string as you would specify it in Latex, by default it uses "r" for number columns and "l" for all other columns

9. $longtable$: Another boolean. Whether to use longtable instead of tabular. However you will need to add usepackage{{longtable}} somewhere in your latex preamble.

10. $decimal$: Works exactly like in $df.to\_csv()$

11. $caption$: string (full caption only) or tuple: ($full\_caption$, $short\_caption$). Results in: \\caption[short\_caption]{{full\_caption}} 

12. $label$: The latex label to be placed in \\label{{}} (to be used in conjunction with \\ref{{}} in your main .tex file)

13. $position$: string of the latex positional arguments to be placed after \\begin{{}} in the output

With all this, make a .txt file containing the latex source for a longtable of each of your dataframes. You should use the same columns as with the pd.to_csv exercise. Row and column names should be shown. "NaN" values (which don't exist in this dataset) should be replaced with the string "Latex" just so that you don't forget about this argument. Round floats to 2 decimals, make the row names in bold (this makes no practical sense, but why not?) Use the longtable package, you should have a short caption and a full caption (decide what you want them to be), don't forget to give your plot a label in latex.

In [None]:
#make latex tables for your dataframes


Conclusion
-----------

And that covers it, that was an overview of some of the basics of pandas for data analysis and visualization, there are of course many more advanced features in pandas, but I think this is a good starting point. If you want to learn about some more plotting tools available in Python, I have made (or am in the process of making depending on when you read this), similar notebooks containing overviews of matplotlib (mostly pyplot), seaborn, and plotly.express()

I would normally add a paragraph with some key takeaways here, but at the moment I'm writing this, I'm too lazy to do it, although this might be edited later.

Finally, here are the sources I used to make this:

1. Pandas documentation website: https://pandas.pydata.org/docs/index.html

2. Iris dataset: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html