Put your name here:
* Saumya Shah


Put the names of the people you worked with here:
* Parth Maheshwari
* KSD Teja
* Tanishq Tanmay
* Kuldeep Singh

Our group divided the datasets according to:

**Penguin**


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

# sns.get_dataset_names()

In [2]:
df_peng = sns.load_dataset("penguins")
df_iris = sns.load_dataset("iris")
df_mpg = sns.load_dataset("mpg")
df_tita = sns.load_dataset("titanic")

____
# Fri ICA
____

This ICA continues with what we did on Wednesday, but now extending that to dataframe (and series) _manipulations_. 

As before, so that you don't get too used to a given dataset, we have opened three above. You have some options:
* each person in your group chooses their own dataset,
* you use as many datasets as there are people in your group, but you all work on them together,
* or, if your group is very large, use at least four datasets. 

____
## Reminder: Tip
____

Pandas is a very large library. Even if I walked you through each and everything it did, you would have forgotten most of it by the time we reached the end next year. 

It is therefore better for you to master the most basic, commonly-used operations and know where to look up more complex data science manipulations. 

Of course, perhaps the best place to start is the `Pandas` [documentation itself](https://pandas.pydata.org/).

Over time, you will find little gems that really organize how to get stuff done with Pandas. One of my favorites is [this website](https://chrisalbon.com/). It is a vast "how-to" guide for data science. 

Bookmark these webpages! 

___
# Doing Stuff: Moving Beyond IDA and EDA
___

## Row and Column Selection


Let's practice basic row and column selections. There are many reasons for these operations, including
* reducing the size of the dataset so that you can do quick operations while you get things to work (then, you use the full dataset for the final result)
* you simply don't want the distraction of columns you will never use,
* some operations don't work on all columns (some of you noticed an issue with color-coding numbers when some columns contained strings),
* and so on....

For this part of the ICA, we assume you know in advance which rows and columns you want. Below, we will use mask-type operations. 

✍️ With your chosen datasets, invent operations you might want to do in the real world. Each person in your group can pick something different. For example, you only want columns with floats; or, you only want the first $4$ columns. Try each of these:
* select a single column,
* select a subset of columns, using a list of column names (the syntax is: `age_sex = titanic[["Age", "Sex"]]`)
* columns also have an index, so try both of these approaches:
  - df. iloc[:, [0,1,3]]
  - df. iloc[:, 0:3]

Discuss with your group what these do and how you would use them. Previously, we thought about `iloc` for selecting rows: here it is used for columns. If you don't know this already, in Python the `:` is used to mean "all". Here, you would read this syntax as "_all rows and columns [0,1,3]_". (Note that your columns may have more meaningful names, but you can still use integers.) 

✍️  As you work, be sure to practice the basic Pandas attributes and methods on your new dataframes:
* `.shape`
* `.describe`
* `.head` and `.tail`
* `.index`
* `.values`
* `.dtypes`
* etc.... 

✍️  One important issue to also pay attention to: when you do something to a dataframe, does it change the dataframe itself, or does it produce a new dataframe (and thereby preserve the original)? This is very important to keep track of: you don't want to lose your data by overwriting it, and you also don't want to generate so many new dataframes you lose track of them. Put a discussion of what you find in a markdown cell. 

✍️ Figure out what this syntax does, discuss within your group and show an example:

`df_tita.iloc[4:14, 1:3]`

In [3]:

df_peng

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


In [4]:
df_peng.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [5]:
df_peng.tail()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,Male


In [6]:
df_peng.iloc[:, :-4].head()

Unnamed: 0,species,island,bill_length_mm
0,Adelie,Torgersen,39.1
1,Adelie,Torgersen,39.5
2,Adelie,Torgersen,40.3
3,Adelie,Torgersen,
4,Adelie,Torgersen,36.7


In [7]:
df_peng.loc[4:6]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female


In [8]:
df_peng.iloc[:-5, :3].values

array([['Adelie', 'Torgersen', 39.1],
       ['Adelie', 'Torgersen', 39.5],
       ['Adelie', 'Torgersen', 40.3],
       ...,
       ['Gentoo', 'Biscoe', 44.5],
       ['Gentoo', 'Biscoe', 48.8],
       ['Gentoo', 'Biscoe', 47.2]], dtype=object)

In [9]:
df_peng.iloc[:-5, :3].dtypes

species            object
island             object
bill_length_mm    float64
dtype: object

In [10]:
np.array(df_peng.iloc[25:66,:].index)

array([25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
       42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
       59, 60, 61, 62, 63, 64, 65], dtype=int64)

Before we get too far, let's explore various ways we can display a dataframe. Of course, we can simply type the name, as in `df_iris`, and press shift-enter. The notebook shows us the top and bottom of the notebook, with the middle not shown. We saw above how to see the top, using `.head()`. Of course, we can also see just the bottom using `.tail()`. 

✍️ Try using `.tail()` so that you can see what it does in comparison to the two other approaches. 

## [`.where`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html)


It is often the case that we want to separate portions of our data based on the content of a column. We can do this with `where`. 

Read the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html) and learn what `where` can do for you. 


✍️ As an example, here is an example: `df_tita.where(df_tita["sex"] == "female")`. As you learn this, it is best to break this down step-by-step. Print out each step (using any of the datasets; here, this is for the titanic dataset):
* `df_tita["sex"]`,
* `df_tita["sex"] == "female"`,
* `df_tita.where(df_tita["sex"] == "female").dropna()`

In a markdown cell, explain what each of these do.

✍️ Next, ask a slightly more complicated question that has two criteria. You will need to adapt this to the datasets you are using, but an example from the penguins dataset might look like:
* `df_peng.where(df_peng["species"] == "Adelie").where(df_peng["sex"] == "Female")`

✍️ `where` is much more useful than the two previous examples. Work through this example and try the idea on your dataset: 
`mask_1 = df_mpg["origin"] == "usa"`

`mask_2 = df_mpg["cylinders"] == 8`

`df_mpg.where(mask_1 & mask_2, "weak!!")`

What is this doing? Explain in a markdown cell. 

In [11]:
df_peng.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [12]:
df_peng.where(df_peng['body_mass_g'] > 3500).dropna().head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male


## [`.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

`Pandas` also offers powerful manipulations similar to what you find with databases. One such manipulation is `groupby`. This manipulation is very powerful, and we will explore it throughout the rest of the semester. 

To keep things simple for this ICA, we'll just start with one simple idea, which is to group the data by a value in a column and find group properties. 


✍️ Discuss with your group what this does, and then apply the idea to your dataset:

`df_mpg.groupby("origin").mean()`

You can also try:

`df_mpg.groupby("origin").std()`

In a markdown cell, describe what this is doing. 

In [13]:
df_peng.dropna().groupby('sex').body_mass_g.mean()

sex
Female    3862.272727
Male      4545.684524
Name: body_mass_g, dtype: float64

**It is grouping all the penguins based on the sex and then finding average weights of both the categories- Males and Females**

___
#### Congrats, you are done! Submit to D2L!