# Why we need pandas?

**Pandas** is a package designed to work with data frames. Data frames are 2-dimentional data strucures that are most commonly used in data analytics, especially in case you are working with tabular data. It enables you to do numerous types of data transformations: 

* reading data of various formats into python: flat files (CSV and delimited), Excel files, databases, etc. 
* clean your data: handle missing values (*NaN*s), convert data types from one to another, etc.
* change the dimensionality of your data: insert, delete columns and rows, etc.
* organize the data in the most efficient way for your analysis: manipulate index column, label observations etc.
* perform split-apply-combine operations on data frames: aggregate data, explore summary statisitics on multiple levels, etc.
* convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects: covert arrays, dictionaries into data frames, etc.
* slice and subset data using labes and indexes
* merge and join multiple data frames

See official documentation and description here: https://pandas.pydata.org/docs/pandas.pdf. 

In [57]:
#! pip install numpy
#! pip install pandas
import numpy as np #this alias is a convention
import pandas as pd #this alias is a convention

Modifying options is not a big deal when you work with relatively small data frames, but useful when working with larger ones. It controls the format of output. You can manage the amount of details output of descriptive functions and methods displays. These are just the example.

In [58]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 2)

## Series and DataFrames

Series are 1-dimentional structures, or, in essence, vectors of values of a certian type. They are fundamental building blocks of a DataFrame. 

Create the example of Series objects with pd.Series() and print the output out.

In [59]:
my_Series = pd.Series(np.random.randn(6), index=['a', 'b', 'c', 'd', 'e', 'f'])
my_Series

a    0.18
b   -1.08
c    1.30
d    0.48
e    0.60
f    0.83
dtype: float64

DataFrame is a 2-dimentional structure, a collection of Series of the same length that describe features of a set of observations.  

You already figured out how to create an array of random numbers of a certain shape.

In [60]:
array_of_random_numbers =  np.array(np.random.rand(4,2))
array_of_random_numbers

array([[0.73337551, 0.37010102],
       [0.2405627 , 0.50079573],
       [0.5093312 , 0.58684981],
       [0.67132398, 0.6109029 ]])

Now, use pd.DataFrame() function to convert this array to a data frame. See what *columns* argument is doing and provide names to the columns of your data frame. 

In [61]:
my_data_frame =  pd.DataFrame(array_of_random_numbers, columns=["trial 1", "trial 2"], index=['a', 'b', 'c', 'd'])
my_data_frame

Unnamed: 0,trial 1,trial 2
a,0.73,0.37
b,0.24,0.5
c,0.51,0.59
d,0.67,0.61


It is common approach in pandas to create a data frame from a dictionary. You already know how to create a dictionary. 

In [62]:
my_dictionary =  {
    "username": "user21243412",
    "followers": "12",
    "following": "243",
}
my_dictionary

{'username': 'user21243412', 'followers': '12', 'following': '243'}

Let's convert our dictionary to a data frame by passing it to **pd.DataFrame()** function. Make sure you specify index like so **index=[0]** as an argument of the function.

In [63]:
my_data_frame_2 = pd.DataFrame(my_dictionary, index=[0])
my_data_frame_2

Unnamed: 0,username,followers,following
0,user21243412,12,243


Nice! Now, let's build on the idea of building data frames from a dictionary. One of the wonderful facts about data frames is that they welcome objects of all data types. This univerality helps us work with diverse features all at once. 

In [88]:
my_data_frame_3 = pd.DataFrame({
    'column A': [1, 2, 3, 4],
    'column B': pd.Timestamp('20200725'),
    'column C': pd.Series(100, index=list(range(4)), dtype='float32'),
    'column D': np.array([22] * 4, dtype='int32'),
    'column E': pd.Categorical(["sun", "rain", "sun", "rain"]),
    'column F': 'verified'
})
my_data_frame_3

Unnamed: 0,column A,column B,column C,column D,column E,column F
0,1,2020-07-25,100.0,22,sun,verified
1,2,2020-07-25,100.0,22,rain,verified
2,3,2020-07-25,100.0,22,sun,verified
3,4,2020-07-25,100.0,22,rain,verified


Run cell below to get information on my_data_frame_3. What can you infer about the data frame? what types of variables it contains?

In [65]:
my_data_frame_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   column A  4 non-null      int64         
 1   column B  4 non-null      datetime64[ns]
 2   column C  4 non-null      float32       
 3   column D  4 non-null      int32         
 4   column E  4 non-null      category      
 5   column F  4 non-null      object        
dtypes: category(1), datetime64[ns](1), float32(1), int32(1), int64(1), object(1)
memory usage: 260.0+ bytes


## Basic commands in pandas

Use the toy data frame **my_data_frame_3** created in the previous section. Add more cells to the notebook in this section and see what these functions and methods are telling you about the data frame. 

* .head() - use n=2 as an argument 
* .tail() - use n=2 as an argument
* .shape
* .columns
* .values
* .dtypes

In [66]:
my_data_frame_3.head(n=2)

Unnamed: 0,column A,column B,column C,column D,column E,column F
0,1,2020-07-25,100.0,22,sun,verified
1,2,2020-07-25,100.0,22,rain,verified


In [67]:
my_data_frame_3.tail(n=2)

Unnamed: 0,column A,column B,column C,column D,column E,column F
2,3,2020-07-25,100.0,22,sun,verified
3,4,2020-07-25,100.0,22,rain,verified


In [68]:
my_data_frame_3.shape

(4, 6)

In [69]:
my_data_frame_3.columns

Index(['column A', 'column B', 'column C', 'column D', 'column E', 'column F'], dtype='object')

In [70]:
my_data_frame_3.values

array([[1, Timestamp('2020-07-25 00:00:00'), 100.0, 22, 'sun',
        'verified'],
       [2, Timestamp('2020-07-25 00:00:00'), 100.0, 22, 'rain',
        'verified'],
       [3, Timestamp('2020-07-25 00:00:00'), 100.0, 22, 'sun',
        'verified'],
       [4, Timestamp('2020-07-25 00:00:00'), 100.0, 22, 'rain',
        'verified']], dtype=object)

In [71]:
my_data_frame_3.dtypes

column A             int64
column B    datetime64[ns]
column C           float32
column D             int32
column E          category
column F            object
dtype: object

Examine what **sample()** function is doing. Can you sample 2 random rows from our toy data frame? 

In [90]:
my_data_frame_3.sample(n=3, random_state=2)

Unnamed: 0,column A,column B,column C,column D,column E,column F
2,3,2020-07-25,100.0,22,sun,verified
3,4,2020-07-25,100.0,22,rain,verified
1,2,2020-07-25,100.0,22,rain,verified


In [87]:
my_data_frame_3.sample(n=1)

Unnamed: 0,column A,column B,column C,column D,column E,column F
0,1,2020-07-25,100.0,22,sun,verified


What random_state is doing? What happens when you use random state and when you do not use it?

Random state uses an integer parameter as a seed for a random number generator, which in this case jumbles around the rows.

## DataFrame Slicing 

Look at these two options - you can slice data frames using columns and rows names with **.loc** and using index position with **.iloc**. 

Using both methods, get a subset of the first two rows and the first three columns. Assign it to a variable called **"subset"**.

In [97]:
subset =  my_data_frame_3.iloc[0:2,0:3]
subset

Unnamed: 0,column A,column B,column C
0,1,2020-07-25,100.0
1,2,2020-07-25,100.0


In [100]:
subset =  my_data_frame_3.loc[0:1, "column A": "column C"]
subset

Unnamed: 0,column A,column B,column C
0,1,2020-07-25,100.0
1,2,2020-07-25,100.0


Do you have an idea why we have to use different numbers for rows to obtain identical result in case of **.loc** and **.iloc**?

Type here

Look up documnetation of **rename()** function. Can you rename columns into, simply "A", "B", "C", "D", "E", "F"? Pay attention to what **inplace** argument is doing.

In [105]:
my_data_frame_3.rename(inplace = True, columns={"column A": "A", "column B": "B", "column C": "C", "column D": "D", "column E": "E", "column F": "F"})

Check if it worked. What basic command you can use for that?

In [106]:
my_data_frame_3.head(0)

Unnamed: 0,A,B,C,D,E,F


Nice, now we can explore 2 strategies of addressing a column of our data frame. You can do it 2 ways:

In [107]:
my_data_frame_3['A']

0    1
1    2
2    3
3    4
Name: A, dtype: int64

In [108]:
my_data_frame_3.A

0    1
1    2
2    3
3    4
Name: A, dtype: int64

Note, that second option only works when there is no space in the name of the variable. Because of this, it is genenrally preferred to use the first option.

Finally, let's check what these functions are doing when applied to columns of the data frame. Feel free to add as many cells as you need and play around with these functions. Do not worry if those does not make sense to you - we will discuss it during our session!

* .describe()
* .value_counts()
* .mean()
* .unique()

In [110]:
my_data_frame_3.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,2.5,100.0,22.0
std,1.29,0.0,0.0
min,1.0,100.0,22.0
25%,1.75,100.0,22.0
50%,2.5,100.0,22.0
75%,3.25,100.0,22.0
max,4.0,100.0,22.0


In [112]:
my_data_frame_3.A.value_counts()

4    1
3    1
2    1
1    1
Name: A, dtype: int64

In [93]:
my_data_frame_3.mean()

column A      2.5
column C    100.0
column D     22.0
dtype: float64

In [113]:
my_data_frame_3.A.unique()

array([1, 2, 3, 4])

Hooray, you made a huge 1st step forward with **pandas**. More **pandas** and bigger datasets to come soon!