## Pandas Data Frames

- What Is Pandas?
- Pandas vs Numpy 
- Pandas Data Frame Intro
- Pandas Data Frame fundamental operations
    - Creating
    - Selecting/indexing
    - Inserting rows/columns
    - Setting data
    - Filtering
    - dropping rwos/ columns
- Dealing with Missing values

%matplotlib inline
import numpy as np
import pandas as pd
from IPython.display import Image
from IPython.display import HTML
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))


CSS = """
.output {
    align-items: center;
}
"""
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Code mode"></form>''')

In [146]:
from IPython.display import display,Image, HTML

CSS = """
.output {
    align-items: center;
}
div.output_area {
    width: 80%;
}
"""
HTML('<style>{}</style>'.format(CSS))

# What is Pandas?

### - Enables working with tabular and labeled data easily and intuitively
### - Pandas is an open-source library built on top of Numpy Package.
- https://github.com/pandas-dev/pandas
- https://github.com/pandas-dev/pandas/blob/059c8bac51e47d6eaaa3e36d6a293a22312925e6/pandas/core/frame.py

### - Pandas data structures are:
    - Series
    - Index
    - Data Frame
    

## Quick refresh to Numpy Arrays..
- contains Numerical ***Homogonius*** Data (same data type)
- may contain multi dimensional array elements.
- used for performing various numerical computations and processing of the multidimensional and single-dimensional array elements.

In [147]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

two_dim_arr = np.random.randint(10, size=(3, 4))  # Two-dimensional array
three_dim_arr = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array


### A two dimensional Array example...

In [148]:
print("Two Dimentional Array")
two_dim_arr

Two Dimentional Array


array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

### What I mean by Homogeneous...

In [149]:
print(two_dim_arr)

[[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]]


**two_dim_arr[0,0] = "Hello"**

In [150]:
#printmd("***Oops....***")
two_dim_arr[0,0] = "Hello" 

ValueError: invalid literal for int() with base 10: 'Hello'

### You can directly and simply form the DataFrame from the 2D array

In [None]:
import pandas as pd

In [None]:

print("Data Frame formed by 2D Array")
#printmd("***df=pd.DataFrame(two_dim_arr)***")

df=pd.DataFrame(two_dim_arr)
df

### Pandas Data Frame is Heterogeneous!
**df.iloc[0,0]="Hello"**

In [None]:
#used to substitute the localte cell [0,0].Means the cell location [row,columm], 
#and the selected one woll be substitute for “Hello“

df.iloc[0,0]="Hello"
df

### Pandas Data Frame labels the data with Indices and Columns labels
pd.DataFrame(np.random.randint(10,size=(3,2)),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

In [None]:
np.random.randint(10,size=(3,2))

In [None]:
##np.random.seed(0). Giving label names for rows and columms 
foo_df=pd.DataFrame(np.random.randint(10,size=(3,2)),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c']
                   )

In [None]:
foo_df

In [None]:
foo_df

### Pandas DataFrame is relevant for statistical observations/data points with various variables (categorical, etc) 

In [None]:
Image("res/Tidy_census.png")

### It is intuitive...  Look how convenient it is!!

In [None]:
people_df= pd.read_csv("data/people.csv")
people_df

In [None]:
#the imagine is in a folder colled res, 
#that is at the same jupyter notebook folder. It is important to make the path easier

Image('res/excel-to-pandas.png') 

source: https://jalammar.github.io/

### Describing the Data Frame...
- df.Info()
- df.count())
- df.describe())
- df.mean())

In [None]:
people_df

In [None]:
#string = object

people_df.info()

In [None]:
#25% is the number on the 1/4 of the list, . 50% iquals median. 75% ist the 3/4 of the list. 
people_df.describe()

In [None]:
#shift + tab to access the function documentation
people_df.count()

In [None]:
x=[1,3,4,50]
print(np.median(x))
print(np.mean(x))

In [None]:
#print(df,"\n")
people_df.info()
#print("df.count()  \n",df.count())
#print("\n df.describe() \n",df.describe())
#print("\n df.mean() \n",df.mean())

### Pandas Data Frame operations

In [None]:
Image("res/CRUD.png")

### Data Frame creation
You can create/form a Data Frame from:
- Dict of 1D ndarrays, lists, dicts, or Series

- 2-D numpy.ndarray

- Structured or record ndarray

- A Series

- Another DataFrame

#### Here is an example...

In [None]:
print('dic = {"col1": [1.0, 2.0, 3.0, 4.0], "col2": [4.0, 3.0, 2.0, 1.0]}\n')

dic = {"col1": [1.0, 2.0, 3.0, 4.0], "col2": [4.0, 3.0, 2.0, 1.0]}

dic

In [None]:
df=pd.DataFrame(dic)
df #calling the function

#### creating Index for the Data frame...

In [None]:
df=pd.DataFrame(dic,index=["a", "b", "c", "d"]) #it is not very commun to name the rows 
df #df as DataFrame, standart name 

In [None]:
np.array(df["col1"])

### Creating Data frame from Pandas Series objects.. 

In [None]:
d = {
       "apples": [3, 2, 0,1],
        "oranges": [0, 3, 7, 2],
    }

pd.DataFrame(d)

In [None]:
pd.DataFrame(d).apples #selecting only the apple collumm

In [None]:
type(pd.DataFrame(d).apples) #type

In [None]:
Image("res/series-and-dataframe.width-1200.png")

source: https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png

### Data Frame Selection / Indexing

In [None]:
data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

students_df = pd.DataFrame(data=data, index=row_labels) # pd.DataFrame(data=data, index=row_labels) / 
#creating labels arguments -> rows = data, collums = row_labels ... 

students_df

In [None]:
students_df.index

Source: https://realpython.com/

### data Selection

In [None]:
students_df.loc[[101],["age"]] 

In [None]:
students_df

In [None]:
students_df.loc[[101],:] #selecting all the collumms from row 101

In [None]:
# pandas series is different from panta data frame, not all the functions work on both or them 


In [None]:
students_df[["age"]] #better to work with this format to manipulate the data

### Selecting by Label
- .loc[]  function

In [None]:
#uses labels to locate

In [None]:
#print("students_df.loc[:, 'city']")
#students_df.loc[:, 'city']

In [None]:
print("students_df.loc[102:106, ['name', 'city']]")
students_df.loc[102,:]

In [None]:
#print('df["city"]')
cities = students_df[["age","city"]]
cities

In [None]:
print("df.city")
students_df.city

### Selecting by Position
- .iloc[]

In [None]:
# uses index to locate, it is not much used

In [None]:
students_df

In [None]:
students_df.iloc[0,1] #selecting by index iloc is numpy 

In [None]:
print("students_df.iloc[1:6, [0, 1]]")
students_df.iloc[1:6, [0, 1]] 

In [None]:
students_df.iloc[:, [3]] #selecting only the last collumm 

In [None]:
students_df.iloc[:, [-1]]

### Hmm.. Can you tell what is the difference between loc and iloc?

loc uses labels and iloc uses index

### Setting/ Updating data

#### let us first update the Data frame index..

In [None]:
students_df

In [None]:
students_df.index = list(np.arange(0, 7))
students_df

In [None]:
students_df=students_df.reset_index()

In [None]:
students_df.index

In [None]:
students_df.iloc[:4, 3] #rows until 3 and collum 3 

In [None]:
students_df.iloc[:4, 3] = [40, 50, 60, 70] #update the data, updated the itens in [:4,3] to [40, 50, 60, 70]
students_df

In [None]:
students_df.loc[5, 'py-score'] 
students_df

In [None]:
students_df.loc[5, 'py-score'] = 70 #loc is locating the label! Amal py_score was updated
students_df

In [None]:
students_df.loc[:,"py-score"]=90 #changes all the values of columm py_score will be updated to 90
students_df

In [None]:
students_df.iloc[:, -1] = [88.0, 90, 81.0, 80.0, 68.0, 61.0, 84.0]
students_df

In [None]:
students_df["py-score"]="Adam-score"
students_df

In [None]:
students_df

In [None]:
students_df["py-score"]=[88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
students_df

In [None]:
students_df.info()

In [None]:
students_df["py-score"]=students_df["py-score"]  * 1.5  # multipling the "py_scores" from all rows to 1,5

In [None]:
students_df

In [None]:
students_df["py-score"]=list(map(lambda x: x+10,students_df["py-score"]))

In [None]:
students_df["py-score"] +=  10


In [None]:
students_df

In [None]:
students_df["py-score"]= students_df["py-score"] + students_df["age"]

In [None]:
students_df

In [None]:
students_df.columns = ['index', 'name', 'city', 'age', 'py-score']

In [None]:
students_df

In [None]:
students_df[["age","py-score"]]

In [None]:
type(students_df[["age"]])

### Inserting/deleteing rows

In [None]:
students_df.columns


In [None]:
students_df  = students_df[['name', 'city', 'age', 'py-score']]
students_df

In [None]:
Ronald = pd.Series(data=['Ronald', 'Berlin', 34, 79],
                 index=students_df.columns[0:4])
students_df= students_df.append(Ronald,ignore_index=True)
students_df

In [None]:
students_df.iloc[7,:]

In [None]:
Ronald = pd.Series(data=['Ronald', 'Berlin', 34, 79],
                 index=students_df.columns[0:4],name=21)

In [None]:
students_df=students_df.append(Ronald)


In [None]:
students_df

In [None]:
students_df.drop(labels=[7],inplace=True)

In [None]:
students_df

### Inserting/Deleting columns

In [None]:
#print('df[js-score] = np.array([71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0])') #creating a new column, 
#just putting a different "name" and asigning values 

students_df['js-score'] =[71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0]
students_df

In [None]:
students_df['py-score-updated'] = students_df['py-score'] * 10 

In [None]:
students_df

### Inserting in a specific location

In [None]:
#print('df.insert(loc=4, column=js-score,value=np.array([86.0, 81.0, 78.0, 88.0, 74.0, 70.0, 81.0]))')
#for inserting a column in a specific location, have to use this fucntion.
#ex. loc =4, means it is going to assume the 4th postion
students_df.insert(loc=4, column='django-score',
          value=np.array([70, 74, 78, 56, 66, 78, 81.0]))
students_df

### dropping specific column

In [None]:
#deleting
## axis= 0 dropping by row,  axis=1. ropping by column
#axis=1 means column, axis=0 means row 

students_df = students_df.drop(labels=['django-score'], axis=1)

In [None]:
students_df

### Filtering/Boolean Indexing

In [None]:
#boolean, is there is one item > than 200, we will receive True
students_df["py-score"] >= 200

In [None]:
#dataframe[condition]. 
students_df[students_df["py-score"]>=200]
#returns empyty because none is bigger than 200

In [None]:
#filterring
students_df[students_df["name"]=="Jana"]

# if it is written like this, it will substitute the values of the column "name" to Jana. !
# students_df = students_df[students_df["name"]=="Jana"]

In [None]:
very_good_students_filter = students_df['py-score'] >= 200
very_good_students_filter

In [None]:
students_df[very_good_students_filter]

### Creating powerful filters with Logical operators AND, OR, NOT, XOR

In [None]:
#print('df[(df[py-score] >= 80) & (df[js-score] >= 80)]')
students_df[(students_df['py-score'] >= 200) | (students_df['js-score'] >= 80)]

## using value counts  function

In [None]:
people_df=pd.read_csv("data/people.csv")
people_df

In [None]:
#frequency counts! very used 

people_df["country"].value_counts()

In [152]:
#define a function
def update_age(age):
    return age + 10

In [153]:
#pandas give the possiblity to apply a functions to the dataframe
#only this will not change my original dataframe, a have to replace the selected values with the new values
people_df.age.apply(update_age)

0    32
1    30
2    33
3    34
4    34
5    40
6    12
Name: age, dtype: int64

In [158]:
#this way we are replacing the values 

people_df["age"]= people_df.age.apply(update_age)
people_df

Unnamed: 0,name,age,country
0,Pol,42,ES
1,Javi,40,ES
2,Maria,43,AR
3,Anna,44,FR
4,Anna,44,UK
5,Javi,50,MA
6,Dog,22,XX


In [161]:
def check_name(name):
    if name.startswith("A"):
        return "Auto"
    else:
        return "Manual"
    

In [162]:
people_df["name"]= people_df.name.apply(check_name)
people_df

Unnamed: 0,name,age,country
0,Manual,42,ES
1,Manual,40,ES
2,Manual,43,AR
3,Auto,44,FR
4,Auto,44,UK
5,Manual,50,MA
6,Manual,22,XX


In [163]:
people_df["age"]= people_df.age.apply(lambda x: x+10)  #rapid way to built a function. X is just an input
people_df

Unnamed: 0,name,age,country
0,Manual,52,ES
1,Manual,50,ES
2,Manual,53,AR
3,Auto,54,FR
4,Auto,54,UK
5,Manual,60,MA
6,Manual,32,XX


In [164]:
#rapid way to built a function. X is just an input, with conditions. 
#The difference to a function is that it is not reusable, 
#it is good if it is going to be used only once, or few times 

people_df["age"]= people_df.age.apply(lambda x: x+10 if x >20 else x+15)  
people_df

Unnamed: 0,name,age,country
0,Manual,62,ES
1,Manual,60,ES
2,Manual,63,AR
3,Auto,64,FR
4,Auto,64,UK
5,Manual,70,MA
6,Manual,42,XX


In [165]:
def update_age(x):
    if x > 20:
        return x+10
    else:
        return x+15
    
    #same way as above, but written as a function 

In [167]:
clara=19
update_age(clara) #wil retunrs the function calculated to any variable, if int 

34