# Python for Data Science

The contents of this workshop have been divided as follows:
1. IPython / Jupyter: Beyond Normal Python
2. Introduction to NumPy
3. Data Manipulation with Pandas
4. Visualization with Matplotlib



## 1. IPython / Jupyter: Beyond Normal Python

IPython (short for Interactive Python) is an enhanced Python interpreter. If Python is the engine of our data science task, you might think of IPython as the interactive control panel. IPython is about using Python effectively for interactive scientific and data-intensive computing. In addition, IPython is closely tied with the Jupyter project, which provides a browser-based notebook that is useful for development, collaboration, sharing, and even publication of data science results. The IPython notebook is actually a special case of the broader Jupyter notebook structure, which encompasses notebooks for Julia, R, and other programming languages.


In [1]:
data = [1, 2.5, '7']
data

[1, 2.5, '7']

In [2]:
data?

[1;31mType:[0m        list
[1;31mString form:[0m [1, 2.5, '7']
[1;31mLength:[0m      3
[1;31mDocstring:[0m  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.


In [3]:
def square(a):
    """Return the square of a."""
    return a ** 2

In [4]:
square(7)

49

In [5]:
square??

[1;31mSignature:[0m [0msquare[0m[1;33m([0m[0ma[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;32mdef[0m [0msquare[0m[1;33m([0m[0ma[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""Return the square of a."""[0m[1;33m
[0m    [1;32mreturn[0m [0ma[0m [1;33m**[0m [1;36m2[0m[1;33m[0m[1;33m[0m[0m
[1;31mFile:[0m      c:\users\sharjeel\documents\github\pythonfords\<ipython-input-3-c96e82bfafc5>
[1;31mType:[0m      function


### Magic Commands vs Shell Commands

**magic commands** are enhancements that IPython adds on top of the normal Python syntax and are prefixed by the " **%** " character. These magic commands are designed to succinctly solve various common problems in standard data analysis. Magic commands come in two flavors: line magics, which are denoted by a single % prefix and operate on a single line of input, and cell magics, which are denoted by a double %% prefix and operate on multiple lines of input.

Some of the magic functions are **`%pwd`**, ``%cd``, ``%cat``, ``%cp``, ``%env``, ``%ls``, ``%man``, ``%mkdir``, ``%more``, ``%mv``, ``%rm``, and ``%rmdir``

In [6]:
%pwd

'C:\\Users\\Sharjeel\\Documents\\GitHub\\PythonforDS'

When working interactively with the standard Python interpreter, one of the frustrations is the need to switch between multiple windows to access Python tools and system command-line tools. IPython bridges this gap, and gives you a syntax for executing shell commands directly from within the IPython terminal. The magic happens with the exclamation point: anything appearing after " **!** " on a line will be executed not by the Python kernel, but by the system command-line.

In [7]:
!ver


Microsoft Windows [Version 10.0.18363.720]


## 2. Introduction to NumPy
Python data can be broadly classified in following two types:

### Basic data types
Like most languages, Python has a number of basic types including integers, floats, booleans, and strings

**Numbers:** Integers and floats work as you would expect from other languages

**Booleans:** Python implements all of the usual operators for Boolean logic

**Strings:**  Python has great support for strings

In [None]:
x = 3
y = 2.5

t = True
f = False

hw = 'Hello World'

print(type(x))
print(type(y))
print(type(t))
print(type(f))
print(type(hw))

### Containers:
Python includes several built-in container types: lists, dictionaries, sets, and tuples.

**Lists:** A list is the Python equivalent of an array, but is resizeable and can contain elements of different types

**Dictionaries:** stores (key, value) pairs

**Sets:** A set is an unordered collection of distinct elements

**Tuples:** A tuple is an (immutable) ordered list of values. 

In [None]:
xs = [3, 1, 2]    # Create a list

d = {'cat': 'cute', 'dog': 'furry'}  # Create a new dictionary with some data

animals = {'cat', 'dog'} # Create a set

t = (5, 6)        # Create a tuple

print(type(xs))
print(type(d))
print(type(animals))
print(type(t))

### Why NumPy?
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. An array is a thin wrapper around C arrays. You should use a Numpy array if you want to perform mathematical operations. Additionally, we can perform arithmetic functions on an array which we cannot do on a list.

In [None]:
import numpy as np

In [None]:
d1 = np.array([1,2])           # 1D array
d2 = np.array([[1,2],[10,20]]) # 2D array

print(type(d1))
print(d2)
d2.ndim

In [None]:
lsData = [1, 2, 3]
print(lsData + [2, 5, 9])

npData = np.array(lsData)
print(npData + [2, 5, 9])

In [None]:
A = np.array( [[1,1],
               [0,1]] )

B = np.array( [[2,0], [3,4]] )

print(A)
print(B)

print(A * B) # Element-wise Matrix Mutliplication
print(A @ B) # Matrix Multiplication

print(np.dot(A, B)) # Matrix Multiplication

In [None]:
print(B[1, :]) # 2nd Row

print(B[:, 0]) # 1st Column

print(A[1,0])  # Element at Row 2, Col 1


### Speed comparison between Python & NumPy

Python and Numpy have some similar functions e.g sum, min, max etc. However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly.


In [None]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

## 3. Data Manipulation with Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to "data munging" tasks that occupy much of a data scientist's time.

In [None]:
import pandas as pd

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

### Reading a CSV dataset
#### Reading using normal python package

In [None]:
import csv
from pprint import pprint

In [None]:
csv_file = open("IMDB-Movie-Data.csv")

reader = csv.reader(csv_file)

In [None]:
headers = next(reader)
pprint(headers)

In [None]:
line = next(reader)
pprint(line)

In [None]:
csv_file.close()

In [None]:
# Complete Process Code for Reading CSV file

data_csv = []

with open("IMDB-Movie-Data.csv", 'r') as csv_file:
    try:
        reader = csv.reader(csv_file, delimiter=',')
        for line in reader:
            data_csv.append(line)
    except:
        print("The process encountered an error!!!")

pprint(np.array(data_csv)[5])

#### Reading using Pandas

In [None]:
movies  = pd.read_csv("IMDB-Movie-Data.csv", delimiter=',')
moviesT = pd.read_csv("IMDB-Movie-Data.csv", index_col='Title')

In [None]:
print(type(movies))

movies.head()

In [None]:
moviesT.tail()

In [None]:
movies.info()

In [None]:
movies.describe()

In [None]:
movies.describe(include='all')

In [None]:
movies.isnull()

In [None]:
movies_drop = movies.dropna()
movies_drop.info()

In [None]:
movies_dropC = movies.dropna(axis='columns')
movies_dropC.info()

#### Extracting Data

In [None]:
subsetDF = movies[['Genre', 'Rating']] # Extraction by Column 

subsetDF.head()

In [None]:
subsetDF_row = moviesT.loc["Prometheus"] # Extraction by Row Name
subsetDF_row

In [None]:
subsetDF_rowT = moviesT.iloc[1] # Extraction by Row Index
subsetDF_rowT

#### Extraction Example
Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but made below the 25th percentile in revenue.

In [None]:
moviesT[
    ((moviesT['Year'] >= 2005) & (moviesT['Year'] <= 2010))
    & (moviesT['Rating'] > 8.0)
    & (moviesT['Revenue (Millions)'] < moviesT['Revenue (Millions)'].quantile(0.25))
]

### Pandas == SQL for Python

**SQL:**  SELECT TOP 5 * FROM movies

In [None]:
movies.head(5)

**SQL:** SELECT * FROM movies

In [None]:
movies

**SQL:** SELECT Title FROM movies

In [None]:
movies[['Title']]

**SQL:** SELECT Title, Genre FROM movies

In [None]:
movies[['Title', 'Genre']]

**SQL:** SELECT * FROM movies WHERE Year = 2014

In [None]:
movies[movies['Year'] == 2014]

**SQL:** SELECT * FROM movies where Year = 2014 AND Rating > 8

In [None]:
movies[(movies['Year'] == 2014) & (movies['Rating'] > 8)]

**SQL:** SELECT * FROM movies WHERE Year = 2014 OR Rating > 8

In [None]:
movies[(movies['Year'] == 2014) | (movies['Rating'] > 8)]

**SQL:** SELECT * FROM movies WHERE Year IS NULL

In [None]:
movies[movies['Year'].isnull()]

**SQL:** SELECT * FROM movies WHERE Year IS NOT NULL

In [None]:
movies[movies['Year'].notnull()]

## 4. Visualization with Matplotlib

 Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. Matplotlib supports dozens of backends and output types, which means you can count on it to work regardless of which operating system you are using or which output format you wish. This cross-platform, everything-to-everyone approach has been one of the great strengths of Matplotlib. It has led to a large user base, which in turn has led to an active developer base and Matplotlib’s powerful tools and ubiquity within the scientific Python world.

In [None]:
import matplotlib.pyplot as plt

plt.style.use('classic')

%matplotlib inline

### Matplotlib Line Plot
Perhaps the simplest of all plots is the visualization of a single function ``y = f(x)``. Here we will take a first look at creating a simple plot of this type


In [None]:
x = np.linspace(-2 * np.pi, 2 * np.pi, 100)
plt.plot(x, np.sin(x))


In [None]:
plt.style.use('seaborn-whitegrid')

plt.plot(x, np.cos(x))
plt.plot(x, np.sin(x), '-y');
plt.plot(x, np.tan(x), '-.r');

plt.legend(['Sine Wave', 'Cosine Wave', 'Tangent Wave'])

plt.xlabel('Time')
plt.ylabel('Amplitude')

plt.ylim([-2, 2])

plt.title("Simple Line Plot")

In [None]:
plt.subplot(3,1,1); plt.plot(x, np.cos(x))

plt.subplot(3,1,2); plt.plot(x, np.sin(x), '-y');

plt.subplot(3,1,3); plt.plot(x, np.tan(x), '-.r');

plt.xlabel('Time')
plt.ylabel('Amplitude')

plt.ylim([-2, 2])

plt.title("Simple Line Plot")

In [None]:
plt.plot(movies['Runtime (Minutes)'])

In [None]:
plt.plot(movies['Runtime (Minutes)'], movies['Revenue (Millions)'])

In [None]:
plt.plot( movies['Rating'], movies['Revenue (Millions)'])

### Matplotlib Sactter Plot
Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. 

In [None]:
plt.scatter(movies['Runtime (Minutes)'], movies['Rating'])

In [None]:
plt.scatter( movies['Rating'], movies['Revenue (Millions)'] , color='black')
plt.xlabel('Movie Ratings')
plt.ylabel('Revenue (in Millions)')

In [None]:
plt.plot( movies['Rating'], movies['Revenue (Millions)'], 'o', color='black')
plt.xlabel('Movie Ratings')
plt.ylabel('Revenue (in Millions)')

In [None]:
movies['Rating'].plot(kind="box") # Utilizes matplotlib for plotting

In recent years, however, the interface and style of Matplotlib have begun to show their age. Newer tools like ggplot and ggvis in the R language, along with web visualization toolkits based on D3js and HTML5 canvas, often make Matplotlib feel clunky and old-fashioned. So people have been developing new packages that build on its powerful internals to drive Matplotlib via cleaner, more modern APIs. **``Seaborn``** is one such example package.

In [None]:
import seaborn as sns

In [None]:
sns.catplot('Year', data=movies, kind='count')
plt.xticks(rotation=45)

In [None]:
sns.pairplot(data=movies_drop);


## References

https://jakevdp.github.io/PythonDataScienceHandbook/

https://docs.scipy.org/doc/numpy/user/quickstart.html

http://cs231n.github.io/python-numpy-tutorial/

https://medium.com/fintechexplained/why-should-we-use-numpy-c14a4fb03ee9

https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

https://www.programiz.com/python-programming/csv

https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

https://byungjun0689.github.io/3.-IMDB-5000-movie-datasets/

https://github.com/maazh/IMDB-Movie-Dataset-Analysis/blob/master/Investigating_IMDB_Movie_dataset.ipynb

https://yoursdata.net/sql-query-python-select/

https://medium.com/datadriveninvestor/data-science-analysis-of-movies-released-in-the-cinema-between-2000-and-2017-b2d9e515d032

### Datasets Search

https://www.kaggle.com/datasets

https://vincentarelbundock.github.io/Rdatasets/datasets.html