# Pandas Crash Course

written by Mehdi Paydayesh on 25/04/2019


## What is Pandas?
The Pandas library is one of the most popular and widely used Python libraries in data science and python programming. It stands for stands for “Python Data Analysis Library” and it has been a game changer in performing data manipulation and analysis. 
It is built on the top of Numpy package which allows Python to read in datasets from various formats, such as CSV file, TSV file, or a SQL database. It offers data structures in ‘DataFrame’ that allow you to read, store, select and manipulate tabular data in rows of observations and columns of variables.  It also lets you quickly grab statistics such as the mean value of a column. 

## Why is it important?
Pandas offers flexible data structures and provides easy syntax and fast operations. In combination with libraries such as matplotlib for data visualization and NumPy for statistics provides a package that is extensively for scientific computing in Python. 

## What you see in this document?
I have put together a tutorial in a form of Notebook in my GitHub repository that covers the basics of Pandas. It covers Pandas DataFrames, basic data manipulations that includes code samples. This tutorial has been prepared for those who seek for a quick guide to dive into Pandas common functions of Pandas. It gives you essentials of that you need to kickstart your data analysis fun in Python!

## Requirements
The major libraries used:
1. Pandas

## How to install 
The easiest way to get Pandas is to install it through the Anaconda distribution. Alternatively you can install pandas with: "pip install pandas" or "conda instal pandas"


## File structure

The notebook file is called **PandasCrashCourse.ipynb** and includes the following sections:

**Part 0: importing libararies**

**Part 1: grabing a specific column**

**Part 2: grabing multiple columns of data**

**Part 3: Using conditioning filtering to select certian rows and columns **

**Part 4: grabing unique values**

**Part 5: grabing all the column names of the data frame**

**Part 6: reporting back the information about the data frame**

**Part 7: reporting the statitics in the data**

**Part 8: reporting the range of index**

**Part 9: Creating a panda data frame using some Numpy generated numbers**

In [1]:
import pandas as pd
import numpy as np

In [2]:
# levanta un csv
df=pd.read_csv('salaries.csv')
print(df)

     Name  Salary  Age
0    John   50000   34
1   Sally  120000   45
2  Alyssa   80000   27


In [None]:
# guarda un csv 
df.to_csv('donde se quiere guardar y con que nombre',index=False)

In [3]:
# grabing a specific column
a= df["Salary"]
print(a)

0     50000
1    120000
2     80000
Name: Salary, dtype: int64


In [5]:
# grabing multiple columns of data
b=df[['Name','Salary']]
print(b)

     Name  Salary
0    John   50000
1   Sally  120000
2  Alyssa   80000


In [6]:
# min, max and mean operations
a= df["Salary"].min()
print(a)
b= df["Salary"].max()
print(b)
c= df["Salary"].mean()
print(c)

50000
120000
83333.33333333333


In [5]:
# Using conditioning filtering to select certian rows and columns 
a=df["Age"]>30
print(a)
print (df[a]) # filtering 
print (df[df["Age"]>30]) # doing the filtering in one-step
df[(df.Age > 30) & (df.Salary == 50000)]
df[(df.Age > 30) | (df.Salary == 50000)]
df[~(df.Age > 30) & (df.Salary == 50000)]


0     True
1     True
2    False
Name: Age, dtype: bool
    Name  Salary  Age
0   John   50000   34
1  Sally  120000   45
    Name  Salary  Age
0   John   50000   34
1  Sally  120000   45


Unnamed: 0,Name,Salary,Age


In [6]:
### otras maneras de seleccionar datos 
## iloc selecciona datos por numero de filas
##loc selecciona por labels
# Single selections using iloc and DataFrame
# Rows:
print(df.iloc[0]) # first row of data frame 
print(df.iloc[1]) # second row of data frame
print(df.iloc[-1]) # last row of data frame
# Columns:
print(df.iloc[:,0]) # first column of data frame
print(df.iloc[:,1]) # second column of data frame 
print(df.iloc[:,-1]) # last column of data frame 

Name       John
Salary    50000
Age          34
Name: 0, dtype: object
Name       Sally
Salary    120000
Age           45
Name: 1, dtype: object
Name      Alyssa
Salary     80000
Age           27
Name: 2, dtype: object
0      John
1     Sally
2    Alyssa
Name: Name, dtype: object
0     50000
1    120000
2     80000
Name: Salary, dtype: int64
0    34
1    45
2    27
Name: Age, dtype: int64


In [None]:
# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]
 
# Change the index to be based on the 'id' column
data.set_index('id', inplace=True)
# select the row with 'id' = 487
data.loc[487]

In [8]:
# grabing unique values
a=df["Age"].unique()
# grabing the number of unique values
b=df["Age"].nunique()
print (a)
print (b)

[34 45 27]
3


In [9]:
# grabing all the column names of the data frame
a=df.columns
print (a)

Index(['Name', 'Salary', 'Age'], dtype='object')


In [10]:
# reporting back the information about the data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Salary  3 non-null      int64 
 2   Age     3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes


In [11]:
# reporting the statitics in the data
df.describe()

Unnamed: 0,Salary,Age
count,3.0,3.0
mean,83333.333333,35.333333
std,35118.845843,9.073772
min,50000.0,27.0
25%,65000.0,30.5
50%,80000.0,34.0
75%,100000.0,39.5
max,120000.0,45.0


In [12]:
# Filtering values

In [15]:
df[df.Name=='John']

Unnamed: 0,Name,Salary,Age
0,John,50000,34


In [18]:
df[df['Salary']==80000]

Unnamed: 0,Name,Salary,Age
2,Alyssa,80000,27


In [13]:
## apply, agg, transform
#dataframe.agg(): only do aggregate operations
#dataframe.apply() The object is a dataframe
#dataframe.transform() performs a transform operation on each series of dataframe, 
# and the returned structure is consistent with the original dataframe.
df2 = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                columns=['A', 'B', 'C'])

df2.agg(['sum', 'min'])

Unnamed: 0,A,B,C
sum,12.0,15.0,18.0
min,1.0,2.0,3.0


In [None]:
#Specify the specified aggregation operation for the specified column of dataframe
df2.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
#apply
df.apply(np.sum, axis=1)
#  function:
np.random.seed(2764)
df=DataFrame({'M':list('ABCBA'),'N':list('XYZXY'),'J':[1,4,6,7,2],'K':[5,3,3,2,6]})

def func(df):
    return df['J']-df['K']

grouped=df.groupby(df['M']).apply(func)
print(grouped)
#transform
df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})


 df.transform(lambda x: x + 1)


 #transform uses multiple custom functions:
df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
trans=df.transform([np.sqrt, np.exp])

In [None]:
### agregar append, merge, join

###
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [None]:
#df2[df2['E'].isin(['two','four'])]
#df con varias condiciones
##Se elimina la columna D
del df2['D']
#df['nueva col']=valores

In [19]:
#Crear DF vacio

In [20]:
#Se crea el DF con diferentes tipos de Series
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : 2,
                     'C' : 2,
                     'D' : 2,
                     'E' : 2,
                     'F' : 2 })

print("Se visualiza el DF")

NameError: name 'np' is not defined

In [21]:
df.head()

Unnamed: 0,Name,Salary,Age
0,John,50000,34
1,Sally,120000,45
2,Alyssa,80000,27


In [22]:
df.tail()

Unnamed: 0,Name,Salary,Age
0,John,50000,34
1,Sally,120000,45
2,Alyssa,80000,27


In [8]:
## convertir a json 
#Vamos a convertir nuestro DataFrame a JSON usando to_json que requiere argumentos como:
#orient, que especifica cuáles deben ser los pares de clave y valor. Por defecto es columns (columnas), por lo que el nombre de la columna es la clave y cada columna es el valor.
#date_format que especifica el formato de la fecha. El valor por defecto es epoch.
dfjson = df.to_json()
print(dfjson)

{"Name":{"0":"John","1":"Sally","2":"Alyssa"},"Salary":{"0":50000,"1":120000,"2":80000},"Age":{"0":34,"1":45,"2":27}}
