# Intro to Data Science



---
<img src="https://calnerds.berkeley.edu/css/images/logo.jpg"  /> <!--style="width: 500px; height: 275px;"-->




### Table of Contents


1 - [Installing Libraries](#section1)<br>

2 - [Data Frames](#section2)<br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Importing Data & Summary Statistics](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Indexing &  Slicing ](#subsection2)<br>


---
## Installing Libraries <a id='section1'></a>


In [10]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib as plt
#import data 

---
## Data Frames <a id='section2'></a>

### 1.2 Importing Data & Summary Statistics  <a id='subsection1'></a>

We will use the function `read_csv()` in the _pandas_ library to import and read our data. The _csv_ at the end of the function tells the program to read a comma-delimited file. However, there are many types of delimiters such as tab, semicolon, pipe, etc. 

We will now read a the _iris.csv_ csv as a **DataFrame** and store it in a variable called _iris_.

In [22]:
iris = pd.read_csv('../data/iris.csv')

Great! Now let's explore our data set. 

We will begin by using the method (or function)  `.head()`. By default, it will show the first 5 rows of or data set, but you can tell it to display the first n results by _passing n as an argument to `.head()`.

In [23]:
iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


You can also see the last _n_ rows of our data using the method `.tail()`.

In [24]:
iris.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


`DataFrames` contain rows and columns. If you want to understand the structure of your DataFrame, there a few functions and attributes that might come handy. 

These include
* `shape`
* `columns`
* `index`
* `info()`
* `describe()`
* `len()`

In [25]:
iris.shape

(150, 5)

The iris DataFrame contains 150 rows and 5 columns.

In [28]:
iris.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

In [32]:
iris.index

RangeIndex(start=0, stop=150, step=1)

In [35]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


As with lists and arrays, you can also use the function `len()` to see how many rows or elements our data set contains.

In [26]:
len(iris)

150

Another cool method is `.describe()`. Describe provides you with some basic statistics about each of the variables in your DataFrame including measures for tendency, dispersion and shape of a
dataset's distribution, excluding **NaN** values.
* By default, it will return the summary statistics of the numeric columns, but it can also work with mixed data. If the method is called on strings it will return measures such as the count, number of unique values, and the most frequent value.

In [27]:
iris.describe()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### 1.2 Indexing &  Slicing  <a id='subsection2'></a>

#### .loc[rows-label(s),columns-label(s)]
`.loc` Helps us view and index our DataFrame. 
* It works with string labels. Notice that most of the times you will have specific column names, but our row names often come as a number. Hence the label of the rows will be a number.   
* It can take 
    * one label __(df.loc[row-label, 'col-label-1'])__
    * a list of labels __(df.loc[[row-label 1, row-label-2, row-label-4],['col-label-1',  'col-label-2', 'col-label-4']])__
    * or a _slice_ of labels __(df.loc[row label-50 : row-label-100,'col-label-1': 'col-label-8'])__


#### Rows

Let's use loc to see what are the values in row 10 in our DataFrame

In [21]:
iris.loc[10]

sepal.length       5.4
sepal.width        3.7
petal.length       1.5
petal.width        0.2
variety         Setosa
Name: 10, dtype: object

* _Noticed that if our rows were labeled with textual information, we would have to use that name instead of "10". In this case the label for the 10th row is indeed 10. 

What if we want to see what are the values in row 5, 10, and 15? Let's pass 5,10, 15 into `loc` as a list of values. 


In [22]:
iris.loc[[5,10,15]]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
5,5.4,3.9,1.7,0.4,Setosa
10,5.4,3.7,1.5,0.2,Setosa
15,5.7,4.4,1.5,0.4,Setosa


This returned a `DataFrame` whereas the first returned a `series`. This is because on this one we selected a range of values. 

How would you use loc to see what are the values of rows 10-20? Yes, you can use a list like in the example above, but it can be quite cumbersome to have to type each number from 10 - 20. There is a better way, and this is slicing, just like we did with arrays and lists. 

In [23]:
iris.loc[10:20]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
10,5.4,3.7,1.5,0.2,Setosa
11,4.8,3.4,1.6,0.2,Setosa
12,4.8,3.0,1.4,0.1,Setosa
13,4.3,3.0,1.1,0.1,Setosa
14,5.8,4.0,1.2,0.2,Setosa
15,5.7,4.4,1.5,0.4,Setosa
16,5.4,3.9,1.3,0.4,Setosa
17,5.1,3.5,1.4,0.3,Setosa
18,5.7,3.8,1.7,0.3,Setosa
19,5.1,3.8,1.5,0.3,Setosa


#### Columns 

Great! Now that you know how to index rows, let's see how we can index columns. Don't forget that we are still using `loc`, so we will have to use column labels.

Let's begin by indexing by one column, variety.

In [26]:
iris.loc[:,'variety']

0         Setosa
1         Setosa
2         Setosa
3         Setosa
4         Setosa
         ...    
145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: variety, Length: 150, dtype: object

Another way to index by only one column is by adding the column label in a list. 

In [31]:
iris.loc[:,['variety']]

Unnamed: 0,variety
0,Setosa
1,Setosa
2,Setosa
3,Setosa
4,Setosa
...,...
145,Virginica
146,Virginica
147,Virginica
148,Virginica


The difference between these two is that the first returned a `series` because only selected a label, and the second returned a n*1 `DataFrame` because we passed a list.  

Noticed that here we had to specify the range of rows that we want to index that column by. We used `:` in order to return all values in the column.

Now, let's index by more than one column. Just as before we will use a list containing our desired column labels. 

In [28]:
iris.loc[:,['sepal.length', 'sepal.width','variety']]

Unnamed: 0,sepal.length,sepal.width,variety
0,5.1,3.5,Setosa
1,4.9,3.0,Setosa
2,4.7,3.2,Setosa
3,4.6,3.1,Setosa
4,5.0,3.6,Setosa
...,...,...,...
145,6.7,3.0,Virginica
146,6.3,2.5,Virginica
147,6.5,3.0,Virginica
148,6.2,3.4,Virginica


Just as we sliced rows, we can do the same with column. 

In [30]:
iris.loc[:,'sepal.length': 'petal.width']

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


#### .iloc[rows_index,columns_index]

Another way to index is using `.iloc`. `iloc` allows us to index using integer positions. 


#### Rows

In [32]:
iris.iloc[[1,3,6,8,9]]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
1,4.9,3.0,1.4,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
6,4.6,3.4,1.4,0.3,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


Recall the __start:stop:step__ from lists? Well we can also select a range of rows with a specified step value in our data DataFrame. In here we will take every 5th element from the 50th row to the 150th row. 

In [34]:
iris.iloc[50:150:5]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
50,7.0,3.2,4.7,1.4,Versicolor
55,5.7,2.8,4.5,1.3,Versicolor
60,5.0,2.0,3.5,1.0,Versicolor
65,6.7,3.1,4.4,1.4,Versicolor
70,5.9,3.2,4.8,1.8,Versicolor
75,6.6,3.0,4.4,1.4,Versicolor
80,5.5,2.4,3.8,1.1,Versicolor
85,6.0,3.4,4.5,1.6,Versicolor
90,5.5,2.6,4.4,1.2,Versicolor
95,5.7,3.0,4.2,1.2,Versicolor


#### Columns

As we mentioned before `iloc` works just as `loc`, but instead of using labels we use the index. Let's get all the rows in the fifth column. Don't forget that we are starting at the 0th index.

In [38]:
iris.iloc[:,4]

0         Setosa
1         Setosa
2         Setosa
3         Setosa
4         Setosa
         ...    
145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: variety, Length: 150, dtype: object

---
Notebook developed by: Kseniya Usovich & Karla Palos

Cal NERDS GitHub: https://github.com/Cal-NERDS
