# Introduction to Python for Data Science: The Basics
 
Created By: The TTLAB Team

<div>
<img src="Data/Images/TTLAB.png" width="100"/>
</div>

___

“Success is stumbling from failure to failure with no loss of enthusiasm.”
― Winston S. Churchill

<div>
<img src="Data/Images/codingcurve.jpg" width="500"/>
</div>

[Source](https://lifehacker.com/why-learning-to-code-is-so-hard-and-what-you-can-do-ab-1685229278)


___
## What is Python? 
* Not a snake (in this case)
* Very popular programming language that is relatively easy to read
* Comes with a generous standard library that supports commonplace programming tasks
* Runs an all major platforms: Mac OSX, Windows, Linux, Unix
* It's free!
* Created by Guido Van Rossum in 1989

## Why use Python in Data Science?

* Shorter learning curve/easy to understand syntax
* Largest collection of popular data science libraries
    * pandas, NumPy, rapids, scikit-learn ... & so much more
* Thriving open-source community
    * You may find libraries on Github that addresses your problems!
* Graphics and Vizualiation possibilities
    * Packages like matplotlib, bokeh, plotly greatly help the visualization workflow
* Deploying models are easy

## Other languages for Data Science

<div>
<img src="Data/Images/UsedProgrammingLanguages.png" width="600"/>
</div>

*Data from the 2018 Kaggle ML & DS Survey*

<div>
<img src="Data/Images/beforeAfter.jpg" width="500"/>
</div>

## Getting Started

* [Python 3](https://www.python.org/downloads/)
* Integrated Development Environment
    * [PyCharm](https://www.jetbrains.com/pycharm/), [VS Code](https://code.visualstudio.com/), [Atom](https://atom.io/) ... 
* Collection of Data Science libraries
    * pandas, numpy, matplotlib ...

## *Easier* Getting Started

* Utilize the [Anaconda Distribution](https://www.anaconda.com/distribution/)
    * Anaconda is an easy to set up Data Science Distribution with a library, dependency and environment manager.
    * Utilizies Jupyter Notebook as a test pad for writing and running Python scripts.

**OR**
* Use Google Colab [https://colab.research.google.com/](https://colab.research.google.com/)
    * Free Jupyter Notebook env. that runs on the cloud.

___
## Basic Commands

Useful [Python "Cheat Sheet"](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf)!

### Printing out information
The infamous "Hello World" program.

In [None]:
print("Hello World")

___
### Comments

Comments are snippets of text that are placed within the code and typically contain short descriptions about the code. 

* Comments in Python are defined by the '#' character

In [None]:
# Author: Darren R.
print("Hello World")

___
### Variables

Variables are used as a storage address with a symbolic name which are referred to within the program. 
 * In Python variable types are dynamically interpreted. 

Most used data types: 
* Integers
* Floating Point Numbers
* Strings 
* Boolean

In [None]:
# Integer
age = 24

# Floating Point Number
height = 177.8

# String
name = "Darren Ramsook"
favColor = "Green"

# Boolean
license = True

In [None]:
print("My name is "+ str(name) + " and I am "+ str(age) + " years old.")

___
### Control Statements

Control statements are a series of intructions that a program follows that allows for case handling and looping.

* if ... else
* if ... elif ... else
* while ...
* for ... in ...

In [None]:
# if ... else

if license == True:
    print("You can drive")
else:
    print("You cannot drive")

In [None]:
# if ... elif ... else

favColor = "Violet"

if favColor == "Green":
    print("Plants")
elif favColor == "Blue":
    print("Sky")
elif favColor == "Red":
    print("Rose")
else:
    print("Pick a new color")

In [None]:
# while ...
count = 0 
while count < 10:
    print(count)
    count = count + 1

In [None]:
# for ... in ...
for i in range(0,10):
    print(i)

___
### Common Data Structures

* List: Collection which is ordered and mutable.
* Tuple: Collection hich is ordered and immutable.
* Dictionary: Collection that is unorderd, changeable and indexed.

In [None]:
# List 
studentsAge = [10,11,12,13]

# List Indexing
print(studentsAge[0])
print(studentsAge[-1])
print(studentsAge[1:3])

studentsAge[0] = 25
print(studentsAge[0])

In [None]:
# Tuple

studentInfo = ("John Smith", 23, True)

print(studentInfo[0])

In [None]:
# Dictionary

rating = {}
rating["Python"] = 83
rating["SQL"] = 44
rating["R"] = 36
print(rating)

___
## Extending Functionality: Using Popular Data Science Packages

* NumPy : package that excels at scientific computing
* pandas : easy to use data structure and data analysis tool
* Matplotlib : plotting library built on NumPy arrays

Once packages are installed (Installed by default if using Anaconda Distribution), you can include them into your workflow through using the *import* command.

### NumPy
NumPy provides high performance multidimensional arrays and tools for these arrays.

Useful Links:
[NumPy "Cheat Sheet"](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf) 

[Documentation](https://docs.scipy.org/doc/numpy/reference/)

Import NumPy under the alias "np" for convenience:

In [None]:
import numpy as np

___
First of all let's create two 2D arrays, A & B:

$A = \begin{bmatrix}1 & 3\\-4 & 3\end{bmatrix}$

$B = \begin{bmatrix}2 & 1\\3 & 1\end{bmatrix}$

To do this in NumPy we use the *np.array()* function.

In [None]:
A = np.array([[1,3],[-4,3]])
B = np.array([[2,1],[3,1]])

print(A)
print(B)

___
We can also create arrays that have initial placeholder values:

* *np.zeros()* : Creates an array of zeros
* *np.ones()* : Creates an array of ones

In [None]:
zerosMatrix = np.zeros((2,2))
onesMatrix = np.zeros((2,2))

print(zerosMatrix)
print(onesMatrix)

___
Arrays can be examined through some built in functions:
* *array.shape* : Returns the array dimensions
* *array.size* : Returns the number of elements in array
* *array.dtype* : Returns the data type of array elements

In [None]:
print(A.shape)
print(A.size)
print(A.dtype)

___ 
Some arithmetic operations that are useful:

* *np.subtract(a,b)* : Subtraction
* *np.add(a,b)* : Addition
* *np.divide(a,b)* : Division
* *np.multiply(a,b)* : Multiplication

In [None]:
print(np.subtract(A,B))
print(np.add(A,B))
print(np.divide(A,B))
print(np.multiply(A,B))

___
Transpose and Inverse:

* *np.transpose()* : transpose
* *np.linalg.inv()* : inverse

In [None]:
print(np.transpose(A))
print(np.linalg.inv(A))

___
Subsets and individual values can be extracted from a Matrix. Consider the 3D Matrix C:

$C = \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6\\ 7 & 8 & 9\end{bmatrix}$

If we needed to extract the value at the first row, last column (0,2):

In [None]:
C = np.array([[1,2,3],[4,5,6],[7,8,9]])

print(C[0,2])

We can also "slice" a portion of an existing array to create a new array:

In [None]:
D = C[0:2,0:2]

print(D)

### pandas

Useful Links:
[Pandas Cheat Sheet](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3)

[Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)


pandas library is created upon NumPy and provides intuitive easy to use data structures and data analysis tools.
___

Import pandas under the alias "pd" for convenience:

In [None]:
import pandas as pd

I/O Funcitonality:

Read files into pandas:
* *pd.read_csv()* : Read CSV file
* *pd.read_excel()* : Read Excel file

Write pandas Dataframe to file:
* *dataframe.to_csv()* : Write to CSV file
* *dataframe.to_csv()* : Write to Excel file

There is a "movieRatings.csv" file in the "Data" folder. Let's try to read it into pandas.

In [None]:
ratingsDF = pd.read_csv("Data/movieRatings.csv")

In [None]:
print(ratingsDF)

We can get more information about the dataset through the use of the *.info()* function:

In [None]:
print(ratingsDF.info())

It seems as though the person entering data missed out the rating for ReviewID 1011. We can ignore this row by dropping it. To drop any erroneous rows we use the *.dropna()* function in pandas.

In [None]:
ratingsDF = ratingsDF.dropna()
print(ratingsDF)

Let's get the average rating of the 11 movies in this list. This can be done though calling the *.mean()* function on a specific column.

In [None]:
ratingMean = ratingsDF['Rating'].mean()
print(ratingMean)

### MatplotLib

Useful Links: [Matplotlib Cheat Sheet](https://python-graph-gallery.com/wp-content/uploads/Matplotlib_cheatsheet_datacamp.png)

Matplotlib is a 2D plotting library that generates publication quality figures.

Below we show an easy plot showing the Rating and Movie Name.

In [None]:
import matplotlib.pyplot as plt

plt.plot(ratingsDF['Movie'], ratingsDF['Rating'])
plt.xticks(rotation=90)
plt.xlabel("MovieID")
plt.ylabel("Rating")
plt.show()

___
## Endless Possibilities

Some good motivational examples: 

https://dash-gallery.plotly.host/dash-oil-and-gas/

https://dash-gallery.plotly.host/dash-object-detection

https://demo.bokeh.org/movies

In [None]:
print("Goodbye")