Pandas few questions with answers and examples.

###### While reading from a file, how do we pull only a subset of the columns or rows: 

In [2]:
import pandas as pd

In [3]:
ufo =pd.read_csv('http://bit.ly/uforeports')

In [4]:
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

The above dataframe has 5columns but i just need 2columns which is 'City' and 'State'.So will just read only those required columns from the source as below instead of reading all the columns and then filter.

Reading only required columns from the source by adding 'usecols' parameter in read_csv function as below:

In [5]:
ufo =pd.read_csv('http://bit.ly/uforeports', usecols=['City','State']) # Pass required columns (python list) to 'usecols' parameter.

In [6]:
ufo.columns

Index(['City', 'State'], dtype='object')

As above we are only reading required columns and we can also reference columns by their positions as below:

In [7]:
ufo =pd.read_csv('http://bit.ly/uforeports', usecols=[0,3])#usecols: Return a subset of the columns.

In [8]:
ufo.columns

Index(['City', 'State'], dtype='object')

So thats how we pull only required columns using column label or column position:

ufo =pd.read_csv('http://bit.ly/uforeports', usecols=[0,3])  ----- column position

ufo =pd.read_csv('http://bit.ly/uforeports', usecols=['City','State'])  ------by column label

###### Method to pull data from CSV file quickly:

In [9]:
ufo =pd.read_csv('http://bit.ly/uforeports', nrows=3) # Just read first 3 rows
# nrows : Number of rows of file to read. Useful for reading pieces of large files.

In [10]:
ufo

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


######  How do Dataframe and Series work with regard to selecting individual entries and iteration: 

pandas series are iterable just like Python list : 

In [13]:
for C in ufo.City:
    print (C)

Ithaca
Willingboro
Holyoke


Iterating through Dataframe, pandas has specific methods, its kind of enumerate: 

In [14]:
for index, row in ufo.iterrows(): # DataFrame.iterrows() : Iterate over DataFrame rows as (index, Series) pairs.
    print(index, row.City, row.State)

0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO


Above we can see when i iterate through the rows of a Dataframe with iterrows() method, we can pull index and row where actually pulling two(City,State) of the Series from that row.

###### Best way to drop each non-numeric column from a Dataframe : 

In [15]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

In [16]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Above Dataframe has 4 numeric (3 int and 1 float) and 2 non-numeric columns.So how do i just keep 4 numeric columns ?

In [17]:
import numpy as np

In [20]:
drinks.select_dtypes(include=[np.number]).dtypes
#DataFrame.select_dtypes(include=None, exclude=None) : Return a subset of the DataFrame’s columns based on the column dtypes.

beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object

###### How to know whether we should pass an argument as a string or a list : 
eg: why to use movie.describe(include=['title']) but why not without square brackets as this movie.describe(include='title')

In [21]:
drinks.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


describe() by default describe all the numeric columns in Dataframe. In describe() we have 'include' parameter it takes a 
list-like, all or [None(default): The result will include all numeric columns.] .So instead of only numeric columns if you want all the columns of dataframe to be described then use describe(include='all') as below :

In [22]:
drinks.describe(include='all')

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
count,193,193.0,193.0,193.0,193.0,193
unique,193,,,,,6
top,Iceland,,,,,Africa
freq,1,,,,,53
mean,,106.160622,80.994819,49.450777,4.717098,
std,,101.143103,88.284312,79.697598,3.773298,
min,,0.0,0.0,0.0,0.0,
25%,,20.0,4.0,1.0,1.3,
50%,,76.0,56.0,8.0,4.2,
75%,,188.0,128.0,59.0,7.2,


In [23]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Next, if you pass a list of dtypes in describe() of Dataframe as below :

In [25]:
drinks.describe(include=['object','float64'])

Unnamed: 0,country,total_litres_of_pure_alcohol,continent
count,193,193.0,193
unique,193,,6
top,Iceland,,Africa
freq,1,,53
mean,,4.717098,
std,,3.773298,
min,,0.0,
25%,,1.3,
50%,,4.2,
75%,,7.2,


As above, it will describe only the list of dtypes we selected in 'include' parameter. 

So the reason the pandas wanted dtypes(1 or more) in list because it wanted to give you the option of specifying multiple types. if you wanted to specify one type, you still have to use a list and it will just be a list of length.