<a href="https://colab.research.google.com/github/husseintarhini/INDE431/blob/main/Introduction_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

The pandas package is the most important tool at the disposal of Data Scientists and Analysts working in Python today

In [None]:
import pandas as pd
import numpy as np

In [None]:
data = {"apples": [3, 2, 0, 1], "oranges": [0, 3, 7, 2]}

You can import a dictionary into a dataframe.

In [None]:
purchases = pd.DataFrame(data)  # create a dataframe from the data dictionary

purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


The above dataframe is indexed over 0,1,2,3 but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [None]:
purchases["total"] = np.exp(purchases["apples"])

In [None]:
purchases = pd.DataFrame(data, index=["June", "Robert", "Lily", "David"])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


So now we could locate a customer's order by using their name:

In [None]:
purchases.loc["June"]

apples     3
oranges    0
Name: June, dtype: int64

You can also location a customer's order by its position in the dataframe

In [None]:
purchases.iloc[0]

apples     3
oranges    0
Name: June, dtype: int64

Importing a csv file

In [None]:
california_housing = pd.read_csv("sample_data/california_housing_train.csv")

In [None]:
california_housing.head(5)  # to see the top 5 columns

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [None]:
california_housing.tail(5)  # to see the last 5 columns

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0
16999,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0


In [None]:
california_housing.describe()  # get meaningful statistics

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [None]:
california_housing.info()  # get info about your data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


Filtering in pandas

In [None]:
california_housing[california_housing["housing_median_age"] < 10]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
58,-115.52,32.67,6.0,2804.0,581.0,2807.0,594.0,2.0625,67700.0
75,-115.55,32.78,5.0,2652.0,606.0,1767.0,536.0,2.8025,84300.0
95,-115.58,32.81,5.0,805.0,143.0,458.0,143.0,4.4750,96300.0
98,-115.58,32.78,5.0,2494.0,414.0,1416.0,421.0,5.7843,110100.0
100,-115.59,32.79,8.0,2183.0,307.0,1000.0,287.0,6.3814,159900.0
...,...,...,...,...,...,...,...,...,...
16679,-122.79,38.48,7.0,6837.0,1417.0,3468.0,1405.0,3.1662,191000.0
16680,-122.79,38.42,9.0,4967.0,885.0,2581.0,915.0,5.0380,185600.0
16692,-122.82,38.55,8.0,6190.0,1088.0,2967.0,1000.0,3.8616,195100.0
16776,-123.00,38.33,8.0,3223.0,637.0,851.0,418.0,5.6445,364800.0


### Applying Functions
If you want to alter your data and/or create new columns you can apply any function on a row or a column

In [None]:
california_housing["total_persons_per_room"] = (
    california_housing["total_rooms"] / california_housing["population"]
)

In [None]:
def rating_function(median_house_value):
    if median_house_value <= 100000:
        return "poor"
    elif median_house_value <= 500000:
        return "middle class"

    else:
        return "rich"

In [None]:
california_housing["rating"] = california_housing["median_house_value"].apply(
    rating_function
)