# Introduction
Welcome to this module on Python for Data Science.

  

In this module
In this module, we will be covering NumPy and Pandas that are the most used Python libraries for Data Science.

 

This module will teach you the basics of NumPy which is the fundamental package for scientific computing in Python. NumPy consists of a powerful data structure called multidimensional arrays. Pandas is another powerful Python library that provides fast and easy data analysis platform. You will also learn about 'Data Visualisation in Python' in a separate module.

 

### Pre-requisite for this Module
The students are expected to have gone through "Intro to Python" module before beginning with this module.

 

In this session
Understand advantages of vectorized code using Numpy (over standard python ways)
Create NumPy arrays
Convert lists and tuples to NumPy arrays
Create (initialise) arrays
Inspect the structure and content of arrays
Subset, slice, index and iterate through arrays
Compare computation times in NumPy and standard Python lists
 

Guidelines for coding console questions
The lectures are interspersed with coding consoles to help you practise writing Python code. You will be given a brief problem statement and some pre-written code. You can write the code in the provided space, verify your answer using test cases, and submit when you are confident about the answer.

 

Note that the coding console questions are non-graded. Some instructions for these questions are as follows:

Ignore the pre-written code on the console. Please don't change it.
Write your answer where you're asked to write it.
You may run and verify your codes any number of times.



#  NumPy Basics
NumPy is a library written for scientific computing and data analysis. It stands for numerical python.

 

The most basic object in NumPy is the ndarray, or simply an array which is an n-dimensional, homogeneous array. By homogenous, we mean that all the elements in a NumPy array have to be of the same data type, which is commonly numeric (float or integer).

 

You can download the Python notebooks used in the lecture from the link below. It is recommended that you keep executing the commands on your computer at pace with the lecture.



# Creating NumPy Arrays
There are multiple ways to create NumPy arrays, the most common ones being:

Convert lists or tuples to arrays using np.array(), as done above
Initialise arrays of fixed size (when the size is known)
The other common way is to initialise arrays. You do this when you know the size of the array beforehand.

The following ways are commonly used:

- np.ones(): Create array of 1s
- np.zeros(): Create array of 0s
- np.random.random(): Create array of random numbers
- np.arange(): Create array with increments of a fixed step size
- np.linspace(): Create array of fixed length

# Structure and Content of Arrays
It is helpful to inspect the structure of NumPy arrays, especially while working with large arrays. Some attributes of NumPy arrays are:

- shape: Shape of array (n x m)
- dtype: data type (int, float etc.)
- ndim: Number of dimensions (or axes)
- itemsize: Memory used by each array element in bytes


# Subset, Slice, Index and Iterate through Arrays
Let's now look at how to access the elements of an array. For one-dimensional arrays, indexing, slicing etc. is similar to python lists - indexing starts at 0.



# Multidimensional Arrays
Multidimensional arrays are indexed using as many indices as the number of dimensions or axes. For instance, to index a 2-D array, you need two indices - array[x, y].

 

Each axes has an index starting at 0. The following figure shows the axes and their indices for a 2-D array.

# Computation Times in NumPy and Standard Python Lists
We mentioned that the key advantages of NumPy are convenience and speed of computation.

 

You'll often work with extremely large datasets, and thus it is important to point for you to understand how much computation time (and memory) you can save using NumPy, compared to standard python lists.

 

Let's compare the computation times of arrays and lists for a simple task of calculating the element-wise product of numbers.

# In this case, NumPy is an order of magnitude faster than lists. This is with arrays of size in millions, but you may work on much larger arrays of sizes in order of billions. Then, the difference is even larger.

 

Some reasons for such difference in speed are:

NumPy is written in C, which is basically being executed behind the scenes
NumPy arrays are more compact than lists, i.e. they take much lesser storage space than lists
The following discussions demonstrate the differences in speeds of NumPy and standard python:

Why are NumPy arrays so fast?  https://stackoverflow.com/questions/8385602/why-are-numpy-arrays-so-fast
Why NumPy instead of Python lists? https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists

 # Summary
To conclude, in this session, you learnt about the most important package for scientific computing in Python NumPy. The various operations that you learnt about are —

Arrays which are the basic data structure in NumPy library.
Creating NumPy arrays from a list or a tuple.
Creating randomly large arrays which can be done using the arange command.
Analysing the shape and dimension of an array using array.shape, array.ndim and so on.
Indexing, slicing and subsetting an array which is very similar to indexing in lists.
Working on multidimensional arrays.
Lastly, you studied the computation times in NumPy and standard Python lists and you concluded that NumPy arrays are faster than standard lists.

 

In the next session, you will dive deep into various operations on NumPy arrays, and you will use these data structures to perform various tasks.

Additional Reading http://www.numpy.org/

# Numpy Arrays


Introduction
In this session we will go through the following:

- Manipulate arrays
- Reshape arrays
- Stack arrays
- Perform operations on arrays
- Perform basic mathematical operations
- Apply built-in functions
- Apply your own functions
- Apply basic linear algebra operations

Basic Operations
Manipulating Arrays
We can manipulate arrays, i.e. changing the shape, combining and splitting arrays, etc.

Reshaping Arrays
Reshaping is done using the reshape() function.

Stacking and Splitting Arrays
Stacking: np.hstack() and n.vstack()

Stacking is done using the np.hstack() and np.vstack() methods. For horizontal stacking, the number of rows should be the same, while for vertical stacking, the number of columns should be the same.

 

Note: The Python notebook used in this section can be downloaded from Introduction section.



### Operations on Arrays
Performing mathematical operations on arrays is extremely simple. Let's see some common operations.

 

Basic Mathematical Operations
Numpy provides almost all the basic math functions - exp, sin, cos, log, sqrt etc. The function is applied to each element of the array.

 

Applying User Defined Functions
You can also apply your own functions on arrays. For e.g. applying the function x/(x+1) to each element of an array.

 

One way to do that is by looping through the array, which is the non-numpy way. You would rather want to write vectorized code.

 

The simplest way to do that is to vectorize the function you want, and then apply it on the array

Numpy provides the np.vectorize() method to vectorize functions.

 

Let's look at both the ways to do it.

 

### Basic Linear Algebra Operations
NumPy provides the np.linalg package to apply common linear algebra operations, such as:

- np.linalg.inv: Inverse of a matrix
- np.linalg.det: Determinant of a matrix
- np.linalg.eig: Eigenvalues and eigenvectors of a matrix
- Also, you can multiply matrices using np.dot(a, b).

### Summary
To conclude, in this session, you learnt about the various operation you may do on a NumPy array. The various operations that you learnt about are —

Manipulating arrays which can be done using reshape().
Stacking and splitting arrays which are similar to merging and appending and can be done using hstack(), vstack().
Applying user-defined functions on an array to vectorized the code using np.vectorize().
Performing linear algebra operations on an array like Inverse, Determinant or eigenvalues of a matrix.

# Introduction to Pandas
Pandas is a library built using NumPy specifically for data analysis. You'll be using Pandas heavily for data manipulation, visualisation, building machine learning models, etc.

 

There are two main data structures in Pandas - Series and Dataframes. The default way to store data is dataframes, and thus manipulating dataframes quickly is probably the most important skill set for data analysis.

In this section, you will study:

- The pandas Series (similar to a NumPy array)
- Creating a pandas series
- Indexing series
- Dataframes
- Creating dataframes from dictionaries
- Importing CSV data files as pandas dataframes
- Reading and summarising dataframes
- Sorting dataframes

### Pandas Basics
The Pandas Series
A series is similar to a 1-D numpy array, and contains scalar values of the same type (numeric, character, datetime etc.). A dataframe is simply a table where each column is a pandas series.

Creating Pandas Series
Series are one-dimensional array-like structures, though unlike NumPy arrays, they often contain non-numeric data (characters, dates, time, booleans etc.)

 

You can create pandas series from array-like objects using pd.Series()

Usually, you will work with Series only as a part of dataframes. Let's study the basics of dataframes.

The Pandas Dataframe
Dataframe is the most widely used data-structure in data analysis. It is a table with rows and columns, with rows having an index and columns having meaningful names.

Creating dataframes from dictionaries
There are various ways of creating dataframes, such as creating them from dictionaries, JSON objects, reading from txt, CSV files, etc.

 

Note: The Python notebook used in this section can be downloaded from Introduction section.

### An important concept in pandas dataframes is that of row indices. By default, each row is assigned indices starting from 0, and are represented at the left side of the dataframe.

 

Let's now learn how we can change or manipulate the default indices and replace it with more sensible indices.



### Indexing and Selecting Data
In this section, we will learn:

- Select rows from a dataframe
- Select columns from a dataframe
- Select subsets of dataframes
- Selecting Rows
- Selecting rows in dataframes is similar to the indexing you have seen in NumPy arrays. The syntax df[start_index:end_index]will subset rows according to the start and end indices.

Selecting Subsets of Dataframes
Until now, you have seen selecting rows and columns using the following ways:

- Selecting rows: df[start:stop]
- Selecting columns: df['column'] or df.column or df[['col_x', 'col_y']]
- df['column'] or df.column return a series
- df[['col_x', 'col_y']] returns a dataframe


But pandas does not prefer this way of indexing dataframes, since it has some ambiguity. For instance, let's try and select the third row of the dataframe.

  
  -------
  
  
###  You have seen some ways of selecting rows and columns from dataframes. Let's now see some other ways of indexing dataframes, which pandas recommends, since they are more explicit (and less ambiguous).

There are two main ways of indexing dataframes:

Position based indexing using df.iloc
Label based indexing using df.loc
Using both the methods, we will do the following indexing operations on a dataframe:

- Selecting single elements/cells
- Selecting single and multiple rows
- Selecting single and multiple columns
- Selecting multiple rows and colum





### To summarise, df.iloc[x, y] uses integer indices starting at 0.

The other common way of indexing is the label based indexing, which uses df.loc[].

Label Based Indexing
Pandas provides the df.loc[] functionality to index dataframes using labels.

As mentioned in the documentation, the inputs x, y to df.loc[x, y] can be:

- A single label, e.g. '3' or 'row_index'
- A list or array of labels, e.g. ['3', '7', '8']
- A range of labels, where row_x and row_y both are included, i.e. 'row_x':'row_y'
- A boolean array Let's see some examples.

### Merge and Append
In this section, you will merge and concatenate multiple dataframes. Merging is one of the most common operations you will do, since data often comes in various files.

In our case, we have sales data of a retail store spread across multiple files. We will now work with all these data files and learn to:

Merge multiple dataframes using common columns/keys using pd.merge()
Concatenate dataframes using pd.concat()

#  You have seen how to do indexing of dataframes using df.iloc and df.loc. Now, let's see how to subset dataframes based on certain conditions.

Subsetting Rows Based on Conditions
Often, you want to select rows which satisfy some given conditions. For e.g., select all the orders where the Sales > 3000, or all the orders where 2000 < Sales < 3000 and Profit < 100.

Arguably, the best way to do these operations is using df.loc[], since df.iloc[] would require you to remember the integer column indices, which is tedious.

Let's see some more in the next lecture.



### Grouping and Summarizing Dataframes
Grouping and aggregation are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.

For e.g., in the retail sales data we are working with, you may want to compare the average sales of various regions, or compare the total profit of two customer segments.

 

Grouping analysis can be thought of as having three parts:

Splitting the data into groups (e.g. groups of customer segments, product categories, etc.)
Applying a function to each group (e.g. mean or total sales of each customer segment)
Combining the results into a data structure showing the summary statistics


In the previous lecture we learned how to create a groupby object, let's now see how to use the created groupby object to carry out various aggregation in the next lecture.

# Lambda function & Pivot tables
Until now, we have not made any changes or modifications to the data. In this section, we will:

- Use lambda functions to create new and alter existing columns
- Use pandas pivot tables as an alternative to df.groupby() to summarise data

### Lambda Functions
Say you want to create a new column indicating whether a given order was profitable or not (1/0).

You need to apply a function which returns 1 if Profit > 0, else 0. This can be easily done using the apply() method on a column of the dataframe.

In the next lecture, we will learn how we can create a new column using the existing columns in our dataframe. Columns which are created by the user are known as 'Derived Variables'. Derived variables increase the information conveyed by the dataframe.



### Pivot Tables
You may want to use pandas pivot tables as an alternative to groupby(). They provide Excel-like functionalities to create aggregate tables.



### Summary
To conclude, in this session, you learnt about the Pandas library which provides the various function to conduct data analysis in Python. The various topics that you learnt about are —

 

Pandas series and dataframes which are the basic data structures in Pandas library.
Indexing, Selecting and Subsetting a dataframe.
Merging and appending two dataframes which can be done using .merge and .concat commands.
Grouping and Summarising dataframes which can be done using groupby() to make an object then using this object to play around.
Using Pivot table function on a dataframe which are similar to pivot tables provided in M.S Excel.
In the next session, you will learn how to get data from various sources and cleaning it to use it for analysis.

# Getting and Cleaning data
### Introduction
There are multiple ways of getting data into python, depending on where the data is stored. The simplest case is when you have data in CSV files, but often, you need to get data from other formats, sources and documents, such as text files, relational databases, websites, APIs, PDF documents, etc.

 

In the following sections, you will learn to get data into python from a number of sources. You will learn to:

- Get data from text files
- Get data from relational databases
- Scrape data from websites
- Get data from publicly available APIs
- Read PDFs into python

In the process, you will also learn how to deal with nuances that inevitably come while getting data from various sources

# Reading Delimited and Relational Databases
- Getting Data From Delimited Files
Delimited files are usually text files, where columns are separated by delimiters (such as commas, tabs, semicolons etc.) and each new line is a row.

 

# Getting Data From Relational Databases
Data is commonly stored in RDBMS, and it is easy to get it into Python. We'll use the most common one - MySQL.

There are many libraries to connect MySQL and Python, such as pymysql, MySQLdb, etc. All of them follow the following procedure to connect to MySQL:

- Create a connection object between MySQL and python
- Create a cursor object (you use the cursor to open and close the connection)
- Execute the SQL query
- Retrieve results of the query using methods such as fetchone(),  fetchall(), etc.


https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
    
    http://jrogel.com/python-3-pandas-encoding-issues/
   

How will you find out the encoding scheme of your dataset if unknown?


Will ask the developer

Use the library chardet
Feedback : Yes, we can use library chardet to identify the encoding scheme used in the given csv file.
Correct

Use read_csv to identify the encoding scheme automatic.

# Reading Data From Websites
Web scraping refers to the art of programmatically getting data from the internet. One of the coolest features of python is that it makes it easy to scrape websites.

 

In Python 3, the most popular library for web scraping is BeautifulSoup. To use BeautifulSoup, we will also need the requestsmodule, which basically connects to a given URL and fetches data from it (in HTML format). A web page is basically HTML code, and the main use of BeautifulSoup is that it helps you parse HTML easily.

Note: Discussion on HTML syntax is beyond the scope of this module, though even very basic HTML experience should be enough to understand web scraping.

 Use Case - Fetching Mobile App Reviews from Google Playstore
Let's say you want to understand why people install and uninstall mobile apps, and why they like or dislike certain apps. A very rich source of app-reviews data is the Google Playstore, where people write their feedback about the app.

 

The reviews of the Facebook Messenger app can be found here: https://play.google.com/store/apps/details?id=com.facebook.orca&hl=en

 

We will scrape reviews of the Messenger app, i.e. get them into python, and then you can do some interesting analyses on that.



What are the two libraries you would need to scrape website data on Python?


beautifulSoup
requests
Words2Note: Once submitted, answer is not editable.
lightbulb_outline
Suggested Answer

1. Requests to connect to a URL and get data from it. 2. Then use BeautifulSoup to create an object.

### Getting Data From APIs
APIs, or application programming interfaces, are created by companies and organisations to provide restricted access to data. It is very common to get data from APIs for data analysis, for example, you can get financial data (stock prices etc.), social media data (Facebook, Twitter etc. provide APIs), weather data, data about healthcare, music, food and drinks, and from almost every domain.

 

Apart from being rich sources of data, there are other reasons to use APIs:

When the data is being updated in real time. If you use downloaded CSV files, you'll have to download data manually and update your analysis multiple times. Through APIs, you can automate the process of getting real-time data.
Easy access to structured and verified data - though you can scrape websites, APIs can directly provide data in structured format and is of better quality
Access to restricted data: You cannot scrape all websites easily, and that's often illegal (e.g. Facebook, financial data etc.). APIs are the only way to get this data.
There are many more reasons depending on the use cases and the domain of application.

A list of useful APIs is available here: https://github.com/toddmotto/public-apis



Example Use Case: Google Maps Geocoding API
Google Maps provides many APIs, one of which is the Google Maps Geocoding API. You can use it to geocode addresses, i.e. get the latitude-longitude coordinates, and vice-versa.



APIs
How will you read data from an API? Briefly, mention the steps involved.


make api calls
Words3Note: Once submitted, answer is not editable.
lightbulb_outline
Suggested Answer

1. Join the words in the address by a plus and convert it to a form words+in+the+address 2. Connect to the URL by appending the address and the API key 3. Get a response from the API and convert it to a python object (here, a dictionary)

#  Reading Data From PDF Files
Reading PDF files is not as straightforward as reading text or delimited files, since PDFs often contain images, tables, etc. PDFs are mainly designed to be human-readable, and thus you need special libraries to read them in python (or any other programming language).

 

Luckily, there are some really good libraries in Python. We will use PyPDF2 to read PDFs in Python since it is easy to use and works with most types of PDFs.

 

Note that Python will only be able to read text from PDFs, not images, tables etc. (though that is possible using other specialised libraries).

 

You can install PyPDF2 using pip install PyPDF2.

 

For this illustration, we will read a PDF of the book 'Animal Farm' written by George Orwell.



Reading PDF Files
Read the document and answer the question.

After creating a pdf file object, a pdf reader object, a page object which command you will use for extracting text from a pdf page?


readText()

extractText()
Feedback : Yes, we can read a pdf page by using extractText() command.
Correct

getText()


# Cleaning Datasets
In this section, we will study ways to identify and treat missing data. We will:

- Identify missing data in dataframes
- Treat (delete or impute) missing values
- There are various reasons for missing data, such as, human-errors during data-entry, non availability at the end of the user (e.g. DOB of certain people), etc. Most often, the reasons are simply unknown.

 

In python, missing data is represented using either of the two objects NaN (Not a Number) or NULL. We'll not get into the differences between them and how Python stores them internally etc. We'll focus on studying ways to identify and treat missing values in Pandas dataframes.

 

There are four main methods to identify and treat missing data:

- isnull(): Indicates presence of missing values, returns a boolean
- notnull(): Opposite of isnull(), returns a boolean
- dropna(): Drops the missing values from a data frame and returns the rest
- fillna(): Fills (or imputes) the missing values by a specified value

For this exercise, we will use the Melbourne house pricing dataset

### Identifying Missing Values
The methods isnull() and notnull() are the most common ways of identifying missing values.

- While handling missing data, you first need to identify the rows and columns containing missing values, count the number of missing values, and then decide how you want to treat them.

 

- It is important that you treat missing values in each column separately, rather than implementing a single solution (e.g. replacing NaNs by the mean of a column) for all columns.

 

- isnull() returns a boolean (True/False) which can then be used to find the rows or columns containing missing values.

### Treating Missing Values
There are broadly two ways to treat missing values:

- Delete: Delete the missing values
- Impute:
Imputing by a simple statistic: Replace the missing values by another value, commonly the mean, median, mode etc.
Predictive techniques: Use statistical models such as k-NN, SVM etc. to predict and impute missing values
In general, imputation makes assumptions about the missing values and replaces missing values by arbitrary numbers such as mean, median etc. It should be used only when you are reasonably confident about the assumptions.

Otherwise, deletion is often safer and recommended. You may lose some data, but will not make any unreasonable assumptions.

 

Caution: Always have a backup of the original data if you're deleting missing values.



 In the previous lecture we learned to handle missing values by deleting them, now let's learn how to impute them with the mean value of the overall column.

Additional Reading
How you treat missing values should ideally depend upon an understanding of why missing values occur. The reasons are classified into categories such as missing completely at random, missing at random, missingness that depends on the missing value itself etc.

 

We'll not discuss why missing values occur, though you can read this article if interested.

http://www.stat.columbia.edu/~gelman/arm/missing.pdf