# Pandas for Data Analysis

* Pandas is an open source library built on top of NumPy.
* It allows fast analysis, data cleaning, and preparation activities.
* It also has built in visualization features.

**To install pandas, we must execute one of the two commands in the terminal.**
* **pip install pandas**
* **conda install pandas**

During this session we will be covering the following concepts. 
* Series
* Dataframes
* Missing Data
* GroupBy
* Merging, Joining, and Concatenation
* Operations
* Data Input and Output

## Series
A series is very similar to a NumPy array. It is built on top of a NumPy array object. 
What differentiates a series from a NumPy array is that a series consists of axis labels which means it can be indexed by a label.

In [25]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [26]:
# Creating a list, array, and a dictionary
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [27]:
# We will now see how to create a pandas series with the following data types.
# We can use SHIFT + TAB to see what the function takes as arguments
# Note that Series must contain an upper case 'S'
pd.Series(data = my_list)

0    10
1    20
2    30
dtype: int64

Here we can see that the index is 0,1,2 and the actual data is 10,20,30.

In [28]:
# Now specifying the index, we get 
pd.Series(data = my_list, index = labels)

a    10
b    20
c    30
dtype: int64

Here, the index is the contents of the my labels list and the data is 10,20,30.

In [29]:
# We can also use this as 
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

In [30]:
# Similarly we can pass in arrays as well 
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

In [31]:
# Using the dictionaries for the series, we get:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

Note that the keys of the dictionary become the index and 
the values of the dictionary become the values of the series.

#### A series can also hold a variety of object types.

In [32]:
pd.Series(data = labels)

0    a
1    b
2    c
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two series, ser1 and ser2:

In [33]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','Russia', 'Japan'])   
ser1

USA        1
Germany    2
Russia     3
Japan      4
dtype: int64

In [34]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan']) 
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [35]:
# Grabbing information from a series works very similar to grabbing information from a python list.
ser1['USA']

1

We can also perform operations based on these indexes.

In [36]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
Russia     NaN
USA        2.0
dtype: float64

In [37]:
ser1 * 2

USA        2
Germany    4
Russia     6
Japan      8
dtype: int64

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Dataframes are a bunch of series that share the same index.

In [38]:
# Seed ensures that we generate the same random numbers when the number passed within the paranthesis are the same
from numpy.random import randn
np.random.seed(101)

In [39]:
# Notice here that the 'D' and 'F' is capitalized.
df = pd.DataFrame(randn(5,4),index=['A','B','C','D','E'],columns=['W','X','Y','Z'])
# Use the SHIFT + TAB feature to check out the docstring of the DataFrame

In [40]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Here, we are obtaining a list of W,X,Y,Z columns with corresponding rows A,B,C,D,E. 
Each of these columns are actually a pandas series. 

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [41]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [42]:
# This is a pandas series. We can go ahead and confirm that with the type function.
type(df['W'])

pandas.core.series.Series

In [43]:
# We can also pass a list of column names.
# Note that since we are using a list, we have an additional set of brackets.
df[['W','X']]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
C,-2.018168,0.740122
D,0.188695,-0.758872
E,0.190794,1.978757


In [44]:
# Creating new columns in pandas
# Pandas supports the creation of a new column by just specifying the column name as though it already exists
df['new'] = df['W'] + df['Y']

In [45]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [46]:
df.drop('new')

KeyError: "['new'] not found in axis"

This gives an error because by default, the df.drop method references the index of the dataframe. We must use axis if we want to specify column names.

In [47]:
df.drop('new', axis = 1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


This drop is not happening in place. This means that we are not affecting the state of the original frame unless we explicitly specify it. 

In [48]:
# If we look at the original dataframe then we understand that it has not changed
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [49]:
# To ensure that this is happening in place we can use one of the two methods. 
# We can either assign the drop function to a veriable or we can specify a third argument 'inplace = True'
df.drop('new', axis=1, inplace=True)

In [50]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [51]:
# We can even drop rows likewise. 
# Note that we do not have to mention the axis as it is 0 by default
df.drop('E')

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [52]:
# Lets now talk about selecting rows
# There are two methods to select rows. loc and iloc.
# Note that the row identifier is enclosed within square brackets in pandas.
df.loc['C']

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [53]:
# We use df.iloc to pass in numerical indices of the element even if the index labels are present.
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [54]:
df.loc['A','Y']

0.9079694464765431

In [55]:
df.loc[['A','B']]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [56]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [57]:
# Let us check which elements in this dataframe are greater than 0
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [58]:
# Just like numpy arrays, we can also use these boolean values to select the elements that hold true in the dataframe.
bool_df = df>0

In [59]:
df[bool_df]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [60]:
# We can even perform these conditional operations for specific columns, rows.
df['W'] > 0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [61]:
# We can include these conditions to return only the rows where the condition holds true
df[df['W'] > 0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [62]:
# We can even stack multiple commands in this manner
# Let us say we want to fetch the values in column X where W > 0 
df[df['W']>0]['X']

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [63]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [64]:
# Let us now reset the index to the default 0,1,2
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


Here we can notice that the index is reset into a separate column

In [65]:
# Now let us try to set a new index. Consider this list
newind = 'CA NY WY OR CO'.split()
newind

['CA', 'NY', 'WY', 'OR', 'CO']

In [66]:
df['States']= newind

In [67]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [68]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


Note that this does not happen in place. When we look into the original dataset:

In [69]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [70]:
# We must specify an argument 'inplace = True'
df.set_index('States',inplace = True)
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [108]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [72]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


Now let's show how to index this! 

For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. 

Calling one level of the index returns the sub-dataframe:

In [73]:
df.loc['G1']

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [74]:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [77]:
# Now if we had to fetch the 2nd row in group G2 of column B, we can use the code
df.loc['G2'].loc[2]['B']
# Note that here, we must work our way from the outside to the inside

0.07295967531703869

# Missing Data

Missing Data is a common problem that we deal with in large datasets. We must handle them correctly for accurate results. Let's show a few convenient methods to deal with Missing Data in pandas:

In [78]:
# Let us create a dataframe as follows
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [79]:
# A lot of times, you will have to drop the missing values to deal with the missing data
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


By default, the operation occurs along rows and axis = 0. But if you want to perform the action on columns, we can specify the axis

In [80]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


When we check the documentation of the dropna method, we can see that there is a threshold argument. This argument requires the thresh number of non NaN values to not get dropped.

In [81]:
# Let us go ahead and set the threshold to 2
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


Because row 1 had at least 2 non NaN values, the row was retained.

Let us now see how to fill in missing values using the method fillna().

In [82]:
df.fillna(value = 'Fill')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,Fill,2
2,Fill,Fill,3


In [83]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [84]:
# We can also perform specific operations such as filling the missing values with the mean of the existing values.
df['A'].fillna(df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

# Groupby

The groupby method allows you to group rows of data together and call aggregate functions

In [109]:
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


**Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [86]:
df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f89780da3a0>

You can save this as a new variable.

In [87]:
df_comp = df.groupby("Company")

Now we can call aggregate methods of the object

In [88]:
df_comp.mean()
# Here pandas computes the mean of the numeric valued columns only. If it encounters a string it just ignores it.

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [89]:
df_comp.sum()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,593
GOOG,320
MSFT,464


In [90]:
df_comp.std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


Notice that it uses the Company as an index and sales as the values.

In [91]:
df_comp.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


Note that in this case it returns even the person as it can consider strings for this function.

In [92]:
df_comp.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


Here we receive even the person name because python is able to store the strings according to alphabetical order.

In [93]:
# The describe function returns a set of characteristics of the dataframe
df_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [94]:
df_comp.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


# Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. 


### Example DataFrames

In [95]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [96]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [97]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


## Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [98]:
# We are passing a list of ddataframes to concatenate based on
pd.concat([df1,df2,df3])
# By default, the axis is 0

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [99]:
# When concatenated by columns, we get
pd.concat([df1,df2,df3],axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


* Similarly, we use the merge function to merge the dataframes like tables in SQL
* Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

Since these are methods specific to certain contexts, 
We will discuss it when we come across it. 

# Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category.

In [101]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [102]:
# The head function returns the first n rows of the dataframe. By default, n is set to 5 rows
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


#### Information on unique rows of the dataframe

In [103]:
# Lists the unique values
df['col2'].unique()

array([444, 555, 666])

In [104]:
# Counts the number of unique values
df['col2'].nunique()

3

In [105]:
# Counts the total number of values
df['col2'].value_counts()

444    2
666    1
555    1
Name: col2, dtype: int64

In [106]:
# Returns the sorted values of the column
df['col2'].sort_values()

0    444
3    444
1    555
2    666
Name: col2, dtype: int64

In [107]:
# Checks if there are null values
df.isnull()

Unnamed: 0,col1,col2,col3
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False


# Data Input and Output

Data for analysis can come from
* CSV files
* Excel sheets
* HTML
* SQL (Not necessary for this course)


### Note: If you have the data file in the same folder as the python notebook, you can just reference it by its name. If the file is present outside of the folder containing the python notebook that you are working on, include the pathname to access the file.

# CSV
We use comma separated values for both input and output

To use CSV files as input:

**var = pd.read_csv('file_name.csv')**

____________________________________________________

To convert the output into CSV files:

**var = pd.to_csv('file_name.csv',index = False)**

______________________________________________________________

## Excel
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash. 

To use excel files as input:

**pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')**
______________________________________________________________

To convert the output into an excel file:

**pd.to_excel('Excel_Sample.xlsx',sheetname='Sheet1')**
______________________________________________________________

### HTML Input

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

To read input from a HTML page:
    
**df = pd.read_html('webpage_address.html')**
____________________________________________________________________________________________

# Data Visualization
We will be using two main libraries for data visualization
* Matplotlib
* Seaborn

## Matplotlib
Matplotlib is the "grandfather" library of data visualization with Python. It was created to try to replicate MatLab's  plotting capabilities in Python. So if you happen to be familiar with matlab, matplotlib will feel natural to you.

You can check out different plotting styles and examples on their official documentation: https://matplotlib.org/stable/gallery/index

To install matplotlib, you will need to execute one of these commands on your terminal:

**conda install matplotlib**

or

**pip install matplotlib**

## Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

You can check out more about the plotting styles in the official documentation: https://seaborn.pydata.org/examples/index.html