# Pandas

Pandas (abbreviation of panel data) and the Panda DataFrame is powerful for manipulating large datasets and unlike Numpy can contain data of more than one type. It can also pull in and read CSV files.

In [16]:
import pandas as pd

cars = pd.read_csv(r"C:\Users\lrspe\Desktop\MS Data Science\5. Python for Data Science\cars.csv", index_col = 0)

print(cars)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


To make column selections of the data, square brackets can be used - single for a Panda series, or double for a Panda DataFrame. 

In [18]:
print(cars['cars_per_cap'])

US     809
AUS    731
JAP    588
IN      18
RU     200
MOR     70
EG      45
Name: cars_per_cap, dtype: int64


Further sub-DataFrames can be made, creating a DataFrame from a pre-existing one, or selecting just one row but displaying it as a series in a column.

In [20]:
print(cars.loc['JAP'])

cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object


In [21]:
print(cars.loc[['US','AUS']])

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False


In [23]:
print(cars.loc[['MOR'], 'drives_right'])

print(cars.loc[['MOR', 'RU'], ['country', 'drives_right']])

MOR    True
Name: drives_right, dtype: bool
     country  drives_right
MOR  Morocco          True
RU    Russia          True


Another useful tool is adding custom columns to the dataset - either by a list or by existing variables.

In [39]:
# Now this has added a new column as specified by the list
cars["gdp"] = ['High', 'High', 'High', 'Medium', 'Medium', 'Low', 'Low']]
print(cars)

     cars_per_cap        country  drives_right     gdp
US            809  United States          True    High
AUS           731      Australia         False    High
JAP           588          Japan         False    High
IN             18          India         False  Medium
RU            200         Russia          True  Medium
MOR            70        Morocco          True     Low
EG             45          Egypt          True     Low


In [42]:
# Here a custom column "cars/100" is added dividing "cars_per_cap" by 100
cars["cars/100"] = cars["cars_per_cap"] / 100
print(cars)

     cars_per_cap        country  drives_right     gdp  cars/100
US            809  United States          True    High      8.09
AUS           731      Australia         False    High      7.31
JAP           588          Japan         False    High      5.88
IN             18          India         False  Medium      0.18
RU            200         Russia          True  Medium      2.00
MOR            70        Morocco          True     Low      0.70
EG             45          Egypt          True     Low      0.45


To quickly understand the structure of the data in the Panda dataframe, use df.shape for columns & rows, df.describe for summary statistics and df.dtypes for data types.

## Advanced Datasets

In [72]:
import pandas as pd
#Import csv file
recent_grads = pd.read_csv(r"C:\Users\L.Spencer\Dropbox\Data Science Shared\Data Files\recent-grads.csv")

# Print data types
print([[recent_grads.dtypes]])

[[Rank                      int64
Major_code                int64
Major                    object
Total                   float64
Men                     float64
Women                   float64
Major_category           object
ShareWomen              float64
Sample_size               int64
Employed                  int64
Full_time                 int64
Part_time                 int64
Full_time_year_round      int64
Unemployed                int64
Unemployment_rate       float64
Median                    int64
P25th                     int64
P75th                     int64
College_jobs              int64
Non_college_jobs          int64
Low_wage_jobs             int64
dtype: object]]


In [84]:
#View the shape of the data
print(recent_grads.shape)

(173, 21)


In [73]:
#View the data in the DataFrame
recent_grads

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972
5,6,2418,NUCLEAR ENGINEERING,2573.0,2200.0,373.0,Engineering,0.144967,17,1857,...,264,1449,400,0.177226,65000,50000,102000,1142,657,244
6,7,6202,ACTUARIAL SCIENCE,3777.0,2110.0,1667.0,Business,0.441356,51,2912,...,296,2482,308,0.095652,62000,53000,72000,1768,314,259
7,8,5001,ASTRONOMY AND ASTROPHYSICS,1792.0,832.0,960.0,Physical Sciences,0.535714,10,1526,...,553,827,33,0.021167,62000,31500,109000,972,500,220
8,9,2414,MECHANICAL ENGINEERING,91227.0,80320.0,10907.0,Engineering,0.119559,1029,76442,...,13101,54639,4650,0.057342,60000,48000,70000,52844,16384,3253
9,10,2408,ELECTRICAL ENGINEERING,81527.0,65511.0,16016.0,Engineering,0.196450,631,61928,...,12695,41413,3895,0.059174,60000,45000,72000,45829,10874,3170


In [71]:
# Output summary statistics exluding object types
recent_grads.describe(exclude=['object'])

Unnamed: 0,Rank,Major_code,Total,Men,Women,ShareWomen,Sample_size,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
count,173.0,173.0,172.0,172.0,172.0,172.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,87.0,3879.815029,39370.081395,16723.406977,22646.674419,0.522223,356.080925,31192.763006,26029.306358,8832.398844,19694.427746,2416.32948,0.068191,40151.445087,29501.445087,51494.219653,12322.635838,13284.49711,3859.017341
std,50.084928,1687.75314,63483.491009,28122.433474,41057.33074,0.231205,618.361022,50675.002241,42869.655092,14648.179473,33160.941514,4112.803148,0.030331,11470.181802,9166.005235,14906.27974,21299.868863,23789.655363,6944.998579
min,1.0,1100.0,124.0,119.0,0.0,0.0,2.0,0.0,111.0,0.0,111.0,0.0,0.0,22000.0,18500.0,22000.0,0.0,0.0,0.0
25%,44.0,2403.0,4549.75,2177.5,1778.25,0.336026,39.0,3608.0,3154.0,1030.0,2453.0,304.0,0.050306,33000.0,24000.0,42000.0,1675.0,1591.0,340.0
50%,87.0,3608.0,15104.0,5434.0,8386.5,0.534024,130.0,11797.0,10048.0,3299.0,7413.0,893.0,0.067961,36000.0,27000.0,47000.0,4390.0,4595.0,1231.0
75%,130.0,5503.0,38909.75,14631.0,22553.75,0.703299,338.0,31433.0,25147.0,9948.0,16891.0,2393.0,0.087557,45000.0,33000.0,60000.0,14444.0,11783.0,3466.0
max,173.0,6403.0,393735.0,173809.0,307087.0,0.968954,4212.0,307933.0,251540.0,115172.0,199897.0,28169.0,0.177226,110000.0,95000.0,125000.0,151643.0,148395.0,48207.0


In [83]:
# select a single column and select the first five rows
sw_col = recent_grads['ShareWomen']
print(sw_col.head())

0    0.120564
1    0.101852
2    0.153037
3    0.107313
4    0.341631
Name: ShareWomen, dtype: float64


In [90]:
# find the max share of women in a major with numpy
print(recent_grads['ShareWomen'].max())

# use loc & idxmax for max of ShareWomen and a series of the other data
recent_grads.loc[recent_grads[['ShareWomen']].idxmax()]

0.968953683


Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
164,165,2307,EARLY CHILDHOOD EDUCATION,37589.0,1167.0,36422.0,Education,0.968954,342,32551,...,7001,20748,1360,0.040105,28000,21000,35000,23515,7705,2868


In [95]:
# convert a Pandas sub-DataFrame to a Numpy Array
recent_grads_np = np.array(recent_grads[['Unemployed', 'Low_wage_jobs']])
print(type(recent_grads_np))

<class 'numpy.ndarray'>


In [97]:
# find out if there is a correlation between low wage jobs and unemployment here (-1 to 1)
print(np.corrcoef(recent_grads_np[:,0],recent_grads_np[:,1]))

# the result is a strong correlation 0.96

[[1.         0.95538815]
 [0.95538815 1.        ]]


## Missing Values

One common task is to search for and replace missing values. NaN is 'not a number'.

In [70]:
#searching for missing values
recent_grads.isnull().sum()

Rank                    0
Major_code              0
Major                   0
Total                   1
Men                     1
Women                   1
Major_category          0
ShareWomen              1
Sample_size             0
Employed                0
Full_time               0
Part_time               0
Full_time_year_round    0
Unemployed              0
Unemployment_rate       0
Median                  0
P25th                   0
P75th                   0
College_jobs            0
Non_college_jobs        0
Low_wage_jobs           0
dtype: int64

In [74]:
# view the missing values of a column in relation to the wider DataFrame
recent_grads[recent_grads.ShareWomen.isnull()]

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
21,22,1104,FOOD SCIENCE,,,,Agriculture & Natural Resources,,36,3149,...,1121,1735,338,0.096931,53000,32000,70000,1183,1274,485
