# Why we need pandas?

**Pandas** is a package designed to work with data frames. Data frames are 2-dimentional data strucures that are most commonly used in data analytics, especially in case you are working with tabular data. It enables you to do numerous types of data transformations: 

* reading data of various formats into python: flat files (CSV and delimited), Excel files, databases, etc. 
* clean your data: handle missing values (*NaN*s), convert data types from one to another, etc.
* change the dimensionality of your data: insert, delete columns and rows, etc.
* organize the data in the most efficient way for your analysis: manipulate index column, label observations etc.
* perform split-apply-combine operations on data frames: aggregate data, explore summary statisitics on multiple levels, etc.
* convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects: covert arrays, dictionaries into data frames, etc.
* slice and subset data using labes and indexes
* merge and join multiple data frames

See official documentation and description here: https://pandas.pydata.org/docs/pandas.pdf. 

In [None]:
#! pip install numpy
#! pip install pandas
import numpy as np #this alias is a convention
import pandas as pd #this alias is a convention

Modifying options is not a big deal when you work with relatively small data frames, but useful when working with larger ones. It controls the format of output. You can manage the amount of details output of descriptive functions and methods displays. These are just the example.

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 2)

## Series and DataFrames

Series are 1-dimentional structures, or, in essence, vectors of values of a certian type. They are fundamental building blocks of a DataFrame. 

Create the example of Series objects with pd.Series() and print the output out.

In [None]:
my_Series = #YOUR CODE GOES HERE
my_Series

DataFrame is a 2-dimentional structure, a collection of Series of the same length that describe features of a set of observations.  

You already figured out how to create an array of random numbers of a certain shape.

In [None]:
array_of_random_numbers =  #YOUR CODE GOES HERE
array_of_random_numbers

Now, use pd.DataFrame() function to convert this array to a data frame. See what *columns* argument is doing and provide names to the columns of your data frame. 

In [None]:
my_data_frame =  #YOUR CODE GOES HERE
my_data_frame

It is common approach in pandas to create a data frame from a dictionary. You already know how to create a dictionary. 

In [None]:
my_dictionary =  #YOUR CODE GOES HERE
my_dictionary

Let's convert our dictionary to a data frame by passing it to **pd.DataFrame()** function. Make sure you specify index like so **index=[0]** as an argument of the function.

In [None]:
my_data_frame_2 = pd.DataFrame(my_dictionary, index=[0])
my_data_frame_2

Nice! Now, let's build on the idea of building data frames from a dictionary. One of the wonderful facts about data frames is that they welcome objects of all data types. This univerality helps us work with diverse features all at once. 

In [None]:
my_data_frame_3 = pd.DataFrame({
    'column A': [1, 2, 3, 4],
    'column B': pd.Timestamp('20200725'),
    'column C': pd.Series(100, index=list(range(4)), dtype='float32'),
    'column D': np.array([22] * 4, dtype='int32'),
    'column E': pd.Categorical(["sun", "rain", "sun", "rain"]),
    'column F': 'verified'
})
my_data_frame_3

Run cell below to get information on my_data_frame_3. What can you infer about the data frame? what types of variables it contains?

In [None]:
my_data_frame_3.info()

## Basic commands in pandas

Use the toy data frame **my_data_frame_3** created in the previous section. Add more cells to the notebook in this section and see what these functions and methods are telling you about the data frame. 

* .head() - use n=2 as an argument 
* .tail() - use n=2 as an argument
* .shape
* .columns
* .values
* .dtypes

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

Examine what **sample()** function is doing. Can you sample 2 random rows from our toy data frame? 

In [None]:
 #YOUR CODE GOES HERE

What random_state is doing? What happens when you use random state and when you do not use it?

Type here

## DataFrame Slicing 

Look at these two options - you can slice data frames using columns and rows names with **.loc** and using index position with **.iloc**. 

Using both methods, get a subset of the first two rows and the first three columns. Assign it to a variable called **"subset"**.

In [None]:
subset =  #YOUR CODE GOES HERE
subset

In [None]:
subset =  #YOUR CODE GOES HERE
subset

Do you have an idea why we have to use different numbers for rows to obtain identical result in case of **.loc** and **.iloc**?

Type here

Look up documnetation of **rename()** function. Can you rename columns into, simply "A", "B", "C", "D", "E", "F"? Pay attention to what **inplace** argument is doing.

In [None]:
 #YOUR CODE GOES HERE

Check if it worked. What basic command you can use for that?

In [None]:
my_data_frame_3.head(2)

Nice, now we can explore 2 strategies of addressing a column of our data frame. You can do it 2 ways:

In [None]:
my_data_frame_3['A']

In [None]:
my_data_frame_3.A

Note, that second option only works when there is no space in the name of the variable. Because of this, it is genenrally preferred to use the first option.

Finally, let's check what these functions are doing when applied to columns of the data frame. Feel free to add as many cells as you need and play around with these functions. Do not worry if those does not make sense to you - we will discuss it during our session!

* .describe()
* .value_counts()
* .mean()
* .unique()

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

In [None]:
 #YOUR CODE GOES HERE

Hooray, you made a huge 1st step forward with **pandas**. More **pandas** and bigger datasets to come soon!