# Intro to Data Science



---
<img src="https://calnerds.berkeley.edu/css/images/logo.jpg"  /> <!--style="width: 500px; height: 275px;"-->




### Table of Contents

1 - [Manipulating Columns ](#section1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Uniqueness](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Frequencies](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Sorting](#subsection3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 - [Min, Max, Range](#subsection4)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.5 - [Missing Values](#subsection5)<br>


2 - [Booleans & Boolean Indexing](#section2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Booleans](#subsection6)<br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Boolean Indexing](#subsection7)<br>



---
## Data Frames 


In [2]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib as plt


Let's begin by reading a new data set called "cereal". We will use `pd.read_csv` just as we did before. You can learn more about the data set [here]().

In [3]:
cereal = pd.read_csv('../data/cereal.csv')

In [4]:
cereal

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.00,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.50,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Triples,G,C,110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
73,Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.00,27.753301
74,Wheat Chex,R,C,100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
75,Wheaties,G,C,100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.00,51.592193


### Manipulating Columns
 <a id='Section1'></a>

####  Uniqueness

Suppose that we want to find out the number of unique manufacturers in our data. The `.unique()` method allows us to check this. 

There are two ways to accomplish this, one is using the "dot" notation, and the other using brackets. For the most part, we will stick to the second method as it can be easy to run into errors.

1)__df.column_label.unique()__

2) __df['column_label'].unique()__



In [5]:
print('There are ',cereal['mfr'].nunique() ,'unique manufacturers')
print('These are: ', cereal['mfr'].unique())

There are  7 unique manufacturers
These are:  ['N' 'Q' 'K' 'R' 'G' 'P' 'A']


Notice that we used the method `.nunique()` to tell us how _many_ unique items we have rather which items. An alternative way to compute this is __len(cereal['mfr'].nunique())__.

#### Frequencies

More specifically, say we want to know how many cereals exist per manufacturers. In this case, we would like to use the `.value_counts()` method instead. This method returns the counts for the unique values in our column. 

In [6]:
cereal['mfr'].value_counts()

K    23
G    22
P     9
R     8
Q     8
N     6
A     1
Name: mfr, dtype: int64

Notice that this method sorts our values in decreasing order? What if you had an alternative sorting that you wanted to use? Maybe you want to sort by index, that is, by alphabetical order. In this case you would want to use the `sort_index()` method as seen below. 

#### Sorting

In [7]:
cereal['mfr'].value_counts().sort_index()

A     1
G    22
K    23
N     6
P     9
Q     8
R     8
Name: mfr, dtype: int64

If instead you wanted to sort by counts, but in ascending order, you can use the `.sort_values()` method instead with the argument __ascending = True__.

In [8]:
cereal['mfr'].value_counts().sort_values(ascending=True)

A     1
N     6
R     8
Q     8
P     9
G    22
K    23
Name: mfr, dtype: int64

#### Min, Max, & Range

Say that for our analysis we want to understand our cereals by the rating feature.

A good starting point might be to see what the __min__ and the __max__ are for our data. We can do this by using the functions `.min()` and `.max()` respectably. 

In [9]:
print('Min rating is :',cereal['rating'].min())

print('Max rating is :',cereal['rating'].max())

Min rating is : 18.042851000000002
Max rating is : 93.704912


To get the range, all you need to do is subtract the min from the max!

Tip: Create a variable for the max and the min so that you don't have to spend time rewriting your code! If you don't remember how to do this, go back to Lesson 1.

Bonus: Use the function `round` to round these two numbers to decimal places. 

#### Missing Values

A common problem that you will come across when analyzing data is __missing__ data. You can check if you data set contains by using the function `.isnull()`. This function returns True whenever a values is missing and False whenever it is not. We can combine this function with `.sum()` to add up all the values that are True  & False.

** In Python (as in most programming languages), True is represented by 1, and False by 0. So using the `sum` function allows us to treat these True/False as numerical values. 

In [10]:
cereal.isnull().sum()

name        0
mfr         0
type        0
calories    0
protein     0
fat         0
sodium      0
fiber       0
carbo       0
sugars      0
potass      0
vitamins    0
shelf       0
weight      0
cups        0
rating      0
dtype: int64

Notice that for the example above we checked for the number of missing values in each of the columns? What if you only wanted to do it for one? You can use the same methods we discuss prior, that is bracket and dot notation.

In [11]:
cereal['rating'].isnull().sum()

0

#### Groupby 

Now, say that we want to find the average amount of calories for the cereals per manufacturer. We can use an operation called `.groupby()`. `.groupby()` involves a combination of splitting an object (a series or column), applying a function (for example `.sum()`,`.mean()`, or `.count()`), and combining the results. 


In [12]:
cereal.groupby("mfr")['calories'].mean()

mfr
A    100.000000
G    111.363636
K    108.695652
N     86.666667
P    108.888889
Q     95.000000
R    115.000000
Name: calories, dtype: float64

Let's break down what happened above. We begin with a DataFrame (`cereal`) and tell pandas (our library) to group by a column (`mfr`). Then we need to specify what column (`calories`) we want to operate our desired operation (`mean`) on.

In [13]:
cereal.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


You can also group by more than one column! You just need to add the columns in a list. 

For instance, let's get the number of cereals by type and  manufacturer.

In [14]:
cereal.groupby(["type", "mfr"])['name'].count()

type  mfr
C     G      22
      K      23
      N       5
      P       9
      Q       7
      R       8
H     A       1
      N       1
      Q       1
Name: name, dtype: int64

The outcome from the groupby above resulted in a `Seires`. If instead you would like to return your data as a `DataFrame` we have to use an additional brackets around the column that we are calling the action on, on this case `name`.

In [15]:
cereal.groupby(["type", "mfr"])[['name']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,name
type,mfr,Unnamed: 2_level_1
C,G,22
C,K,23
C,N,5
C,P,9
C,Q,7
C,R,8
H,A,1
H,N,1
H,Q,1


### Booleans
 <a id='section2'></a>

### Boolean Indexing
 <a id='section3'></a>

Suppose we only want to look at the cereals that behave less than 100 calories. We will use __boolena indexing__ to create a DataFrame that meets this criteria. 

We will accomplish this by:
1. Selecting the `calories` column from the DataFrame. 
2. Now we will create an array of Booleans where each value is True if and only if the value in the calories is less than 100, otherwise it will return False. You will have to use a boolean operator such as <,>, <=,>=, ==, !=, etc. on the column. 
3. Use the 

In [19]:
cereal[cereal['calories']<100]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
8,Bran Chex,R,C,90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253
9,Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813
50,Nutri-grain Wheat,K,C,90,3,0,170,3.0,18.0,2,90,25,3,1.0,1.0,59.642837
54,Puffed Rice,Q,C,50,1,0,0,0.0,13.0,0,15,0,3,0.5,1.0,60.756112
55,Puffed Wheat,Q,C,50,2,0,0,1.0,10.0,0,50,0,3,0.5,1.0,63.005645
60,Raisin Squares,K,C,90,2,0,0,2.0,15.0,6,110,25,3,1.0,0.5,55.333142
63,Shredded Wheat,N,C,80,2,0,0,3.0,16.0,0,95,0,1,0.83,1.0,68.235885


---
Notebook developed by: Kseniya Usovich & Karla Palos

Cal NERDS GitHub: https://github.com/Cal-NERDS
