<img src="AV_Logo.png" style="width: 200px;height: 75px"/>

## Table of Contents
* [Why learn Python for data analysis?](#Why-learn-Python-for-data-analysis?)
* [Python Data Structures](#Python-Data-Structures)
* [Conditional and iterative statements](#Conditional-and-iterative-statements)
* [Loading data](#Loading-data)
* [Understand pandas dataframes](#Understand-pandas-dataframes)

### Why learn Python for data analysis?

Python has gained a lot of interest recently as the main language for data analysis. In comparison against SAS & R, here are some reasons which go in favour of learning Python:

* Open Source – free to install
* Awesome online community
* Easy to learn
* Possibility to become the common language for data science and production of web based analytics products.

Needless to say, it still has few drawbacks too:

* It is an interpreted language rather than compiled language – hence might take up more CPU time. However, given the savings in programmer time due to ease of learning, it might still be a good choice.
* Python is still not used in client side of softwares, such as mobile phones. Instead faster and more efficient languages are preferred. 

Let's do a simple addition in jupyter notebook. 

*Note: "#" operator in python is used to comment a line. This line is for a programmer to read and does not contribute in the actual programming*

In [1]:
# Add 2 & 2 and assign it to "addition" variable
addition = 2 + 2 

In [2]:
print(addition) # print addition

4


In jupyter notebook, you can automatically print a variable by just typing its name.For example:

In [3]:
addition

4

Note: You should still use print command to control what should be printed and what shouldn't be printed. 

In [4]:
# multiply 4 & 4 and assign it to "multiplication"
multiplication = 4 * 4

In [5]:
print(multiplication) # print multiplication

16


**Exercise**

Q1 Add two numbers 3 and 4, then assign it to "answer" variable. 

In [6]:
answer = 3 + 4

Q2 Divide two numbers 6 and 3, then print out the solution. 

In [7]:
divide = 6/3
print(divide)

2.0


### Python Data Structures

Following are some data structures, which are used in Python. You should be familiar with these in order to use them appropriately.

#### Lists
Lists are one of the most versatile data structure in Python. A list can simply be defined by writing a list of comma separated values in square brackets. Lists might contain items of different types, but usually the items all have the same type. Python lists are mutable and individual elements of a list can be changed.

Here is a quick example on how to define a list and then access it. 

A list  can be simply defined by writing comma separated values in square brakets. 

In [8]:
square_list = [0, 1, 4, 9, 16, 25] # define a list

In [9]:
print(square_list) # print square_list

[0, 1, 4, 9, 16, 25]


Individual elements of a list can be accessed by writing index number in square bracket. Please note that the first index of list starts with 0 and not 1. 

In [10]:
print(square_list[0]) # print first element of list

0


A range of list can be accessed by having first and last index. 

In [11]:
print(square_list[2:4]) # slice square_list. 

[4, 9]


You can see here that the first index is included whereas last index is excluded. 

--------

A negative index accesses the list from end. 

In [12]:
print(square_list[-2]) # print second last element in the list

16


#### Strings 
Strings can simply be defined by use of single ( ' ), double ( " ) or triple ( ''' ) inverted commas. Strings enclosed in triple quotes ( ''' ) can span over multiple lines and are used frequently in docstrings (Python’s way of documenting functions). \ is used as an escape character. Please note that Python strings are immutable, so you cannot change part of the strings.

A string can be simply defined by using single (') or double (") quotations

In [13]:
greeting = "Hello"              # assign a string 
print (greeting[1])             # return character at index 1
print (len(greeting))           # print length of string
print (greeting + "World")      # string concatenation

e
5
HelloWorld


Raw strings can be used to pass on the string as it is. Python interpreter does not alter the string, if you specify it to be raw. Raw strings can be defined by adding "r" before the string

In [14]:
stmt = r'\n is a newline character by default'
print (stmt)

\n is a newline character by default


Python strings are immutable. This means that it can't be changed, any changes in the string will result in an error.

In [15]:
greeting[1] = 'i'

TypeError: 'str' object does not support item assignment


#### Dictionary
Dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). A pair of braces creates an empty dictionary: { }


In [16]:
d = {'One': 1}

Here, "One" is the key and "1" is value of dictionary "d"

We can add one key value pair as:

In [17]:
extensions = {'Kunal': 9073, 'Tavish': 9128, 'Sunil': 9223, 'Nitin': 9330}
print(extensions)

{'Sunil': 9223, 'Kunal': 9073, 'Tavish': 9128, 'Nitin': 9330}


In [18]:
extensions['Mukesh'] = 9410

In [19]:
print ('Before: ', extensions['Mukesh'])
extensions['Mukesh'] = 9150
print ('After: ', extensions['Mukesh'])

Before:  9410
After:  9150


In [20]:
print(extensions.keys())

dict_keys(['Sunil', 'Kunal', 'Tavish', 'Nitin', 'Mukesh'])


In [21]:
print(extensions.values())

dict_values([9223, 9073, 9128, 9330, 9150])


**Exercise**

Q1. Make a list of names "Alpha", "Beta", "Gamma", "Theta" & "Omega".Assign it to "names" variable, then print out the fourth name from the beginning. 

In [22]:
names = ["Alpha", "Beta", "Gamma", "Theta", "Omenga"]
print(names[3:4])

['Theta']


Q2. Create a dictionary with the keys, "Alpha", "Beta", "Gamma", "Theta" & "Omega" along with values in ascending order starting from 1; viz, 

    Alpha -> 1
    Beta -> 2
    Gamma -> 3
and so on

In [23]:
viz = {"Alpha":1, "Beta":2, "Gamma":3, "Theta":4, "Omega":5}
print(viz)

{'Omega': 5, 'Beta': 2, 'Gamma': 3, 'Alpha': 1, 'Theta': 4}


### Conditional and iterative statements

Coming to conditional statements, these are used to execute code fragments based on a condition. The most commonly used construct is if-else, with following syntax:

    if [condition]:
      __execution if true__
    else:
      __execution if false__
      
You can see that there is an indent (space) before "__execution if true__" and "__execution if false__" statement. This is necessary in Python to give indentation. If you don't indent correctly, it will give an error.

In [24]:
if 2%3 == 1:
print("yes")

IndentationError: expected an indented block (<ipython-input-24-98296dc2ad98>, line 2)

As you can see, python interpreter is highlighting indentation error. 

------------

For instance, if we want to print whether the number N is even or odd:

    if N%2 == 0:
      print 'Even'
    else:
      print 'Odd'

Like most languages, Python also has a FOR-loop which is the most widely used method for iteration. It has a simple syntax:

    for i in [Python Iterable]:
      expression(i)

Here “Python Iterable” can be a list, tuple or other advanced data structures. Let’s take a look at a simple example, determining the factorial of a number.

In [25]:
for i in range(1,10):
      print (i)

1
2
3
4
5
6
7
8
9


You can see above, "10" is not printed because "range" excludes last index

**Exercise**

Q1. Print even numbers from the given list - "number_list". 

In [26]:
number_list = [0, 2, 4, 5, 7, 8]
evens=[x for x in number_list if x % 2 == 0]
print(evens)

[0, 2, 4, 8]


In [27]:
import numpy as np

In [28]:
list=[1,2,3,4,5,6] #whatever your list is, just a sample
evens=[x for x in list if np.mod(x,2)==0]
print(evens)

[2, 4, 6]


Let's go one step ahead in our journey to learn Python by getting acquainted with some useful libraries. The first step is obviously to learn to import them into our environment. There are several ways of doing so in Python:

In [29]:
import math as m

from math import *

In the first manner, we have defined an alias 'm' to library math. We can now use various functions from math library (e.g. factorial) by referencing it using the alias 'm.factorial()'.

In the second manner, you have imported the entire name space in math i.e. you can directly use factorial() without referring to math.

Following are a list of libraries, you will need for any scientific computations and data analysis in Python:

* **NumPy** stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms,  advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++
* **SciPy** stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
* **Matplotlib** for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
* **Pandas** for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.
* **Scikit Learn** for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

### Loading data

Finally for today, let's take a look at how to load a dataset in python. There are many source in which data can be stored. In this session, we will load a dataset which is stored in csv format. For other formats, you can refer [this article](https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/).

In [30]:
import pandas as pd  # import pandas




In [31]:
data = pd.read_csv('data.csv') # read file

In [32]:
# see only the first five rows
data.head(5)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


### Understand pandas dataframes

A DataFrame in pandas is a tabular data structure comprised of rows and columns, similar to a spreadsheet or a database table. We will take a look at how to deal with a dataframe

In [33]:
# To access a column
data['Item_Identifier']

0      FDW58
1      FDW14
2      NCN55
3      FDQ58
4      FDY38
5      FDH56
6      FDL48
7      FDC48
8      FDN33
9      FDA36
10     FDT44
11     FDQ56
12     NCC54
13     FDU11
14     DRL59
15     FDM24
16     FDI57
17     DRC12
18     NCM42
19     FDA46
20     FDA31
21     NCJ31
22     FDG52
23     NCL19
24     FDS10
25     FDX22
26     NCF19
27     NCE06
28     DRC27
29     FDE21
       ...  
247    NCC55
248    FDH52
249    FDO09
250    FDU43
251    FDX09
252    FDH40
253    FDW58
254    FDC47
255    NCR41
256    DRG49
257    FDC52
258    FDY13
259    FDY35
260    FDI35
261    NCU18
262    FDY33
263    FDC15
264    FDU01
265    NCU42
266    NCV29
267    NCY30
268    FDQ46
269    FDG58
270    FDK33
271    NCD07
272    NCX29
273    NCT42
274    DRN59
275    FDI44
276    FDA45
Name: Item_Identifier, dtype: object

In [34]:
# To access multiple column
data[['Item_Identifier', 'Item_Weight']]

Unnamed: 0,Item_Identifier,Item_Weight
0,FDW58,20.750
1,FDW14,8.300
2,NCN55,14.600
3,FDQ58,7.315
4,FDY38,
5,FDH56,9.800
6,FDL48,19.350
7,FDC48,
8,FDN33,6.305
9,FDA36,5.985


In [35]:
# To access a row
data.loc[0]

Item_Identifier                          FDW58
Item_Weight                              20.75
Item_Fat_Content                       Low Fat
Item_Visibility                     0.00756484
Item_Type                          Snack Foods
Item_MRP                               107.862
Outlet_Identifier                       OUT049
Outlet_Establishment_Year                 1999
Outlet_Size                             Medium
Outlet_Location_Type                    Tier 1
Outlet_Type                  Supermarket Type1
Name: 0, dtype: object

In [36]:
# To access multiple rows
data.loc[0:5]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3
5,FDH56,9.8,Regular,0.063817,Fruits and Vegetables,117.1492,OUT046,1997,Small,Tier 1,Supermarket Type1


In [37]:
# to access specific row and specific column
# for example, you have to extract 2nd row value for 3rd column
data.ix[1, 2]

'reg'

In [38]:
# if we want to access only those rows when 'Item_type' is 'Dairy', we can do as follows
data[data.Item_Type == 'Dairy']

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3
28,DRC27,13.8,Low Fat,0.058102,Dairy,244.6802,OUT046,1997,Small,Tier 1,Supermarket Type1
46,FDR14,11.65,Low Fat,0.291322,Dairy,55.8298,OUT010,1998,,Tier 3,Grocery Store
59,FDE52,10.395,Regular,0.029947,Dairy,90.1514,OUT045,2002,,Tier 2,Supermarket Type1
61,FDL51,20.7,Regular,0.04776,Dairy,214.9876,OUT017,2007,,Tier 2,Supermarket Type1
72,FDZ14,7.71,Regular,0.047783,Dairy,122.3756,OUT018,2009,Medium,Tier 3,Supermarket Type2
79,FDA14,16.1,LF,0.065129,Dairy,145.176,OUT013,1987,High,Tier 3,Supermarket Type1
84,FDY27,6.38,Low Fat,0.032079,Dairy,177.6344,OUT017,2007,,Tier 2,Supermarket Type1
89,FDB03,17.75,Regular,0.262504,Dairy,242.2538,OUT010,1998,,Tier 3,Grocery Store


**Exercise**

Q1. Load "data.txt" file and print first 10 rows

That's all for today!
----------------------------

-------------------------------
<img src="AV_Datafest_logo.png" style="width: 200px;height: 200px"/>
[www.analyticsvidhya.com](www.analyticsvidhya.com)

DATAFEST 2017