### CSV to DataFrame


Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use read_csv().

In [4]:
#import pandas
import pandas as pd

#import the cars csv
cars = pd.read_csv('cars.csv')

#print out cars
print(cars)

    Unnamed: 0  cars_per_cap        country  drives_right
NaN         US           809  United States          True
NaN        AUS           731      Australia         False
NaN        JAP           588          Japan         False
NaN         IN            18          India         False
NaN         RU           200         Russia          True
NaN        MOR            70        Morocco          True
NaN         EG            45          Egypt          True


In [7]:
# Import pandas as pd
import pandas as pd

# Fix import by including index_col
cars = pd.read_csv('cars.csv', index_col = 1)

# Example 5: Use del keywords to drop 
# First column of dataframe
del cars[cars.columns[0]]

# Print out cars
print(cars)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


### Square Brackets to extract columns

a Pandas Series is a 1D array of data, but a single-column DataFrame is a 2D table with one column. The main distinction between the two is this. For a single-column DataFrame, an index can be optional, but a Series has to have an index defined.

    Pandas Series     Single column DataFrame
    1D Table          2D Table
    Columns = 0       Columns = 1
    Index requiried   Index not required
    quick performance Slow Performance
 
Series is like a column and Dataframe is the whole table
    

In [8]:
# Print out country column as Pandas Series
print(cars['country'])

# Print out country column as Pandas DataFrame
print(cars[['country']])

# Print out DataFrame with country and drives_right columns
print(cars[['country','drives_right']])

US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object
           country
US   United States
AUS      Australia
JAP          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt
           country  drives_right
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True


In [9]:
# Print out first 3 observations
print(cars[0:3])

# Print out fourth, fifth and sixth observation
print(cars[3:6])

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
     cars_per_cap  country  drives_right
IN             18    India         False
RU            200   Russia          True
MOR            70  Morocco          True


### loc and iloc

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

Try out the following commands in the IPython Shell to experiment with loc and iloc to select observations. Each pair of commands here gives the same result.

    cars.loc['RU']
    cars.iloc[4]
    
    cars.loc[['RU']]
    cars.iloc[[4]]
    
    cars.loc[['RU', 'AUS']]
    cars.iloc[[4, 1]]

In [10]:
cars.loc['RU']
cars.iloc[4]

cars_per_cap       200
country         Russia
drives_right      True
Name: RU, dtype: object

In [11]:
cars.loc[['RU']]
cars.iloc[[4]]

Unnamed: 0,cars_per_cap,country,drives_right
RU,200,Russia,True


In [12]:
cars.loc[['RU', 'AUS']]
cars.iloc[[4, 1]]

Unnamed: 0,cars_per_cap,country,drives_right
RU,200,Russia,True
AUS,731,Australia,False


In [22]:
# Use loc to select the observation corresponding to Japan as a Series. The label of this row is JAP, the index is 2. Make sure to print the resulting Series.
print(cars.loc[['JAP']])
# Print out using loc observations for Egypt and India
print(cars.loc[['EG','IN']])

     cars_per_cap country  drives_right
JAP           588   Japan         False
    cars_per_cap country  drives_right
EG            45   Egypt          True
IN            18   India         False


In [23]:
# Use loc to select the observation corresponding to Japan as a Series. is 2. Make sure to print the resulting Series.
print(cars.iloc[[2]])
# Print out  using iloc observations for Egypt and India
print(cars.iloc[[3,6]])

     cars_per_cap country  drives_right
JAP           588   Japan         False
    cars_per_cap country  drives_right
IN            18   India         False
EG            45   Egypt          True


In [25]:
# Print out drives_right value of Morocco
print(cars.loc[['MOR'], ['drives_right']])

# Print out a sub-DataFrame, containing the observations for Russia and Morocco and the columns country and drives_right
print(cars.loc[['RU', 'MOR'], 
                ['country', 'drives_right']])


     drives_right
MOR          True
     country  drives_right
RU    Russia          True
MOR  Morocco          True


In [27]:
# Print out drives_right column as Series
print(cars.loc[:,'drives_right'])

# Print out drives_right column as DataFrame
print(cars.loc[:,['drives_right']])

# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ['cars_per_cap', 'drives_right']])

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool
     drives_right
US           True
AUS         False
JAP         False
IN          False
RU           True
MOR          True
EG           True
     cars_per_cap  drives_right
US            809          True
AUS           731         False
JAP           588         False
IN             18         False
RU            200          True
MOR            70          True
EG             45          True


In [29]:
# Print out drives_right column as Series
print(cars.iloc[:, 2])

# Print out drives_right column as DataFrame
print(cars.iloc[:, [2]])

# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ['cars_per_cap', 'drives_right']])

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool
     drives_right
US           True
AUS         False
JAP         False
IN          False
RU           True
MOR          True
EG           True
     cars_per_cap  drives_right
US            809          True
AUS           731         False
JAP           588         False
IN             18         False
RU            200          True
MOR            70          True
EG             45          True


### Driving right

In [30]:
# Extract drives_right column as Series: dr
dr = cars['drives_right']

# Use dr to subset cars: sel
sel = cars[dr]

# Print sel
print(sel)

     cars_per_cap        country  drives_right
US            809  United States          True
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


In [31]:
# Convert code to a one-liner
sel = cars[cars['drives_right']]

# Print sel
print(sel)

     cars_per_cap        country  drives_right
US            809  United States          True
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


### Cars per capita

Similar to the previous example, you'll want to build up a boolean Series, that you can then use to subset the cars DataFrame to select certain observations. If you want to do this in a one-liner, that's perfectly fine!

In [32]:
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap'] 
many_cars = cpc > 500
car_maniac = cars[many_cars]

# Print car_maniac
print(car_maniac)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False


In [34]:
#import numpy
import numpy as np

# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]

# Print medium
print(medium)

    cars_per_cap country  drives_right
RU           200  Russia          True
