# 001_Getting Started With Python & Data Analysis

#### The Complete Data Science Pipeline

The data science pipeline can be described as end-to-end process in which each step contributes to producing the final insights. Every data science projects begins defining a clear problem it aims to solve or business/technical questions to provide answer to. Data is the core of data science, hence, scoping and collecting the right data for the project is very crucial to achieving the required results.

Complete Data Science PIPELINE:
Data_Scoping --> Data_Review --> Feature_Engineering --> Feature_Review --> Model_Selection_and_Review --> Model_Evaluation_and_Insighs --> Interaction/Production --> Feedback

After collection od data, the next step in the pipeline involves wrangling, reviewing and transforming the data from a messy/raw form to a more appropriate state for ease of use. Although this can be time-consuming, it is very essential to clean the data extensively since machine learning models are only as good as the data provided - garbage in garbage out. Conducting Exploratory Data Analysis (EDA) on the cleaned data using visualisations and statistical methods gives a quick insight into the various patterns and relationships between features in the dataset. Modelling involves using statistical and machine learning methods for classifying and clustering the processed data to create predictive models. Several evaluation methods are employed to compare the performance of these models and continuously improve before a final model is selected. Finally, all the work done in the pipeline is irrelevant if the results cannot be interpreted and communicated properly to the appropriate audience. It is imperative to present findings from the analysis done through visualisations and clear reporting. For the most part, the data science pipeline is not a linear process; it’s instead an iterative process.

#### Libraries for Python Data Analysis

Pandas, Numpy, Scipy, Matplotlib and Seaborn are essential python libraries used for data analysis.

Numerical computations for arrays and multidimensional matrices in data analysis are often done with the Numeric Python library - NUMPY.

PANDAS is a toolkit built on NumPy with data structures called DATAFRAMES; used on numerical and time-series data for quick and easy data manipulation, cleaning and analysis.

SCIPY can be described as a scientific package that uses NumPy arrays as its basic data structure.

MATPLOTLIB and SEABORN are plotting libraries capable of handling large datasets and producing both interactive and statistical graphics.

# 002_Numpy Array And Vectorization

#### Introduction to NumPy & Creating Arrays.

As mentiones previously, NumPy is a library that has ndarray as its basic data structure used to handle arrays and matrices. NumPy arrays has a grid of values all of which are of the same data type, mostly integers and floats. These arrays can also be created from Python lists.

Below are some examples:

In [1]:
# Convention for importing numpy

import numpy as np

In [2]:
arr = [6,7,8,9]
print(type(arr))

<class 'list'>


Lists are one of 4 built-in data types in Python used to store collections of data( multiple items in a single variable), the other 3 are Tuple, Set, and Dictionary, all with different qualities and usage.

Lists are created using square brackets[]. 

In [3]:
# To convert list to numpy array

a = np.array(arr)
print(a)
print(type(a))
print(f" {a.shape}, -a is a 1d array with {a.shape[0]} items" )
print(a.dtype)

#To get the dimension of a with ndim
print(a.ndim)

[6 7 8 9]
<class 'numpy.ndarray'>
 (4,), -a is a 1d array with 4 items
int32
1


In [4]:
b = np.array([[1,2,3], [4,5,6]])
print(b)
print(b.ndim)
print(f"{b.shape}, -b is a {b.ndim}d array with {b.shape[0]}rows and {b.shape[1]}columns")

[[1 2 3]
 [4 5 6]]
2
(2, 3), -b is a 2d array with 2rows and 3columns


There are also some inbuilt functions that can be used to initialize numpy which include empty(), zeros(), ones(), full(), random.random().

In [5]:
# To generate a 2rows_3columns(2x3)array with random values
np.random.random((2,3))

array([[0.51993007, 0.44842564, 0.15928139],
       [0.91412601, 0.14065568, 0.68196804]])

In [6]:
# To generate 2x3 arrays with zeros
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [7]:
# To generate 2x3 arrays with ones
np.ones((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [8]:
#To generate 3x3 identity matrix
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [9]:
#To generate 2x3 arrays full of 7
np.full((2,3), 7)

array([[7, 7, 7],
       [7, 7, 7]])

In [10]:
np.empty((2,2))

array([[1.02405275e-259, 2.60658458e+214],
       [4.57668639e-072, 5.94484328e-096]])

In [11]:
#The empty() function is used to create a new array of given shape and type, without initializing entries.
#Examples:
print(np.empty(2))
print("###########################################")
print(np.empty([2,2]))
print("###########################################")
print(np.empty([2,3], dtype=int))
print("###########################################")
print(np.empty([2,3], dtype=float))

[ 1.29619797e+238 -2.02442286e-059]
###########################################
[[1.02405275e-259 2.60658458e+214]
 [4.57668639e-072 5.94484328e-096]]
###########################################
[[         0 1072693248          0]
 [1072693248          0 1072693248]]
###########################################
[[1. 1. 1.]
 [1. 1. 1.]]


In [12]:
 ##FUNCTION	         ##DESCRIPTION
#abs()	           Returns the absolute value of a number
#all()	           Returns True if all items in an iterable object are true
#any()	           Returns True if any item in an iterable object is true
#ascii()	           Returns a readable version of an object. Replaces none-ascii characters with escape character
#bin()	           Returns the binary version of a number
#bool()	           Returns the boolean value of the specified object
#bytearray()	       Returns an array of bytes
#bytes()	           Returns a bytes object
#callable()	       Returns True if the specified object is callable, otherwise False
#chr()	           Returns a character from the specified Unicode code.
#classmethod()	   Converts a method into a class method
#compile()	       Returns the specified source as an object, ready to be executed
#complex()	       Returns a complex number
#delattr()	       Deletes the specified attribute (property or method) from the specified object
#dict()	           Returns a dictionary (Array)
#dir()	           Returns a list of the specified object's properties and methods
#divmod()	       Returns the quotient and the remainder when argument1 is divided by argument2
#enumerate()	       Takes a collection (e.g. a tuple) and returns it as an enumerate object
#eval()	           Evaluates and executes an expression
#exec()	           Executes the specified code (or object)
#filter()	       Use a filter function to exclude items in an iterable object
#float()	           Returns a floating point number
#format()	       Formats a specified value
#frozenset()	       Returns a frozenset object
#getattr()	       Returns the value of the specified attribute (property or method)
#globals()	       Returns the current global symbol table as a dictionary
#hasattr()	       Returns True if the specified object has the specified attribute (property/method)
#hash()	           Returns the hash value of a specified object
#help()	           Executes the built-in help system
#hex()          	   Converts a number into a hexadecimal value
#id()	           Returns the id of an object
#input()        	   Allowing user input
#int()	           Returns an integer number
#isinstance()	   Returns True if a specified object is an instance of a specified object
#issubclass()	   Returns True if a specified class is a subclass of a specified object
#iter()	           Returns an iterator object
#len()	           Returns the length of an object
#list()	           Returns a list
#locals()	       Returns an updated dictionary of the current local symbol table
#map()	           Returns the specified iterator with the specified function applied to each item
#max()	           Returns the largest item in an iterable
#memoryview()	   Returns a memory view object
#min()	           Returns the smallest item in an iterable
#next()	           Returns the next item in an iterable
#object()	       Returns a new object
#oct()	           Converts a number into an octal
#open()	           Opens a file and returns a file object
#ord()	           Convert an integer representing the Unicode of the specified character
#pow()	           Returns the value of x to the power of y
#print()  	       Prints to the standard output device
#property()	       Gets, sets, deletes a property
#range()	           Returns a sequence of numbers, starting from 0 and increments by 1 (by default)
#repr()	           Returns a readable version of an object
#reversed()	       Returns a reversed iterator
#round()	           Rounds a numbers
#set()	           Returns a new set object
#setattr()	       Sets an attribute (property/method) of an object
#slice()	           Returns a slice object
#sorted()	       Returns a sorted list
#staticmethod()     Converts a method into a static method
#str()	           Returns a string object
#sum()	           Sums the items of an iterator
#super()	           Returns an object that represents the parent class
#tuple()	           Returns a tuple
#type()	           Returns the type of an object
#vars()	           Returns the __dict__ property of an object
#zip()	           Returns an iterator, from two or more iterators

#### Intra-operability of Arrays and Scalars.

Vectorizationin numpy arrays allows for faster processing by eliminating for loops when dealing with arrays of equal shape. This allows for batch arithmetic operations on the arrays by applying the operator elementwise.  Similarly, scalars are also propagated element-wise across an array. For arrays with different sizes, it is impossible to perform element-wise operations instead; numpy handles this by broadcasting provided the dimensions of the arrays are the same or, one of the dimensions of the array is 1.

In [13]:
c = np.array([[9.0,8.0,7.0],[1.0,2.0,3.0]])
d = np.array([[4.0,5.0,6.0],[7.0,8.0,9.0]])

print(f"addition of  cand d is: \n {c+d}")
print(f"multiplication of c and d is: \n {c*d}")
print(f"To divide 5 by d, we have: \n {5/d}")
print(f"Calculating the exponential of c by 2, we have: \n {c**2}")

addition of  cand d is: 
 [[13. 13. 13.]
 [ 8. 10. 12.]]
multiplication of c and d is: 
 [[36. 40. 42.]
 [ 7. 16. 27.]]
To divide 5 by d, we have: 
 [[1.25       1.         0.83333333]
 [0.71428571 0.625      0.55555556]]
Calculating the exponential of c by 2, we have: 
 [[81. 64. 49.]
 [ 1.  4.  9.]]


#### Indexing With Arrays & Using Arrays for Data Processing

INDEXING: To access elements in an array

In [14]:
print(a[0])
print(a[3])
print(a[-1])
print(b[0,0])
print(b[1,2])
print(c[0,1])

6
9
9
1
6
8.0


SLICING:  is a feature that enables accessing PARTS of sequences like strings, tuples, and lists

In [15]:
print(d[1, 0:2])

e = np.array([[10,11,12],[13,14,15],[16,17,18],[19,20,21]])

#Slicing
e[:3, :2]

[7. 8.]


array([[10, 11],
       [13, 14],
       [16, 17]])

In [16]:
#integer indexing
e[[2,0,3,1],[2,1,0,2]]

array([18, 11, 19, 15])

In [17]:
#boolean indexing meeting a specified condition
e[e>15]

array([16, 17, 18, 19, 20, 21])

In [18]:
e.ndim

2

Numpy also has inbuilt mathematical functions like sum(), mean(), std(), corrcoef(), min() and others. It interestingly allows for comparing arrays using == to check if two arrays have the same elements,  elements in the first array are greater than or less than those of the second array using  > and  <.

In [19]:
sum(c)

array([10., 10., 10.])

In [20]:
np.sum(c)

30.0

In [21]:
d.max()

9.0

In [22]:
np.max(d)

9.0

In [23]:
d.min()

4.0

In [24]:
np.min(d)

4.0

In [25]:
c.mean()

5.0

In [26]:
np.mean(c)

5.0

In [27]:
c.std()

3.1091263510296048

In [28]:
np.std(c)

3.1091263510296048

In [29]:
np.corrcoef(c,d)

array([[ 1., -1., -1., -1.],
       [-1.,  1.,  1.,  1.],
       [-1.,  1.,  1.,  1.],
       [-1.,  1.,  1.,  1.]])

In [30]:
np.corrcoef(c)

array([[ 1., -1.],
       [-1.,  1.]])

In [31]:
np.corrcoef(d)

array([[1., 1.],
       [1., 1.]])

In [32]:
c[0]==d[1]

array([False,  True, False])

#### File Input and Output With Arrays

Numpy arrays can be loaded from and saved to binary files with .npy as the extension using load() and save() respectively. This can also be done with text files with text files using loadtxt() and savetxt().

In [33]:
#np.save('C:\Users\LENOVO T460S\Documents\Hamoye AI\Data_Science_Track\STAGE1_Intro_to_Python_for_ML\LESSON1_PythonInML', np.array([[2,4,6],[8,10,12]]))
#np.load('C:\Users\LENOVO T460S\Documents\Hamoye AI\Data_Science_Track\STAGE1_Intro_to_Python_for_ML\LESSON1_PythonInML.npy')

In [34]:
import numpy as geek
  
# StringIO behaves like a file object
from io import StringIO   
  
c = StringIO("0 1 2 \n3 4 5")
d = geek.loadtxt(c)
  
print(d)

[[0. 1. 2.]
 [3. 4. 5.]]


In [35]:
new_array = np.array([24, 15,27,38,15,16,25,38,485,15,16,15])
result=np.savetxt('test9.txt', new_array, delimiter=',', fmt='%d')
print(result)

None


# 003_Pandas - So Much More Than A Cute Animal

#### Introducing Pandas data structures: Series, DataFrames and Index objects.

PANDAS is a library built on Numpy which is used for data manipulation, with other ways of indexing other than integers. Series, Dataframe, and index are the basic data structures in this library. Series in pandas can be referred to as a one dimensional array with homogenous elements of different types somewhat similar to numpy arrays; however, it can be indexed differently with specified descriptive labels or integers. 

In [36]:
# Convention for importing pandas

import pandas as pd

In [37]:
days = pd.Series(['Monday', 'Tuesday', 'Wednesday'])
print(days)

0       Monday
1      Tuesday
2    Wednesday
dtype: object


In [38]:
days_list = np.array(['Monday', 'Tuesday','Wednesday'])
print(days_list)

numpy_days = pd.Series(days_list)
print(numpy_days)

['Monday' 'Tuesday' 'Wednesday']
0       Monday
1      Tuesday
2    Wednesday
dtype: object


In [39]:
# Using strings as index

days = pd.Series(['Monday','Tuesday','Wednesday'],index=['a','b','c'])
print(days)

days1 = pd.Series({'a':'Monday','b':'Tuesday','c':'Wednesday'})
print(days1)

a       Monday
b      Tuesday
c    Wednesday
dtype: object
a       Monday
b      Tuesday
c    Wednesday
dtype: object


In [40]:
#Series can be accessed using the specified index as shown below

print(days[0]) 
print(days[1:])
print(days['c'])

Monday
b      Tuesday
c    Wednesday
dtype: object
Wednesday


In [41]:
days1['b']

'Tuesday'

A DataFrame can be described as a table (2 dimensions) made up of many series with the same index. It holds data in rows and columns just like a spreadsheet. Series, dictionaries, lists, other dataframes, and numpy arrays can be used to create new ones.

In [42]:
print(pd.DataFrame())

Empty DataFrame
Columns: []
Index: []


In [43]:
# Creating a dataFrame from a dictionary

df_dict = {'Country':['Ghana','Kenya','Nigeria','Togo'], 'Capital':['Accra','Nairobi','Abuja','Lome'], 'Population':[10000, 8500, 35000, 12000], 'Age':[60,70,80,75]}
df = pd.DataFrame(df_dict, index=[2,4,6,8])
df

Unnamed: 0,Country,Capital,Population,Age
2,Ghana,Accra,10000,60
4,Kenya,Nairobi,8500,70
6,Nigeria,Abuja,35000,80
8,Togo,Lome,12000,75


In [44]:
# Creating a dataFrame from a list

df_list = [['Ghana','Accra',10000,60],['Kenya','Nairobi',8500,70],['Nigeria','Abuja',35000,80],['Togo','Lome',12000,75]]
df1= pd.DataFrame(df_list,columns=['Country','Capital','Population','Age'],index=[2,4,6,8])
df1

Unnamed: 0,Country,Capital,Population,Age
2,Ghana,Accra,10000,60
4,Kenya,Nairobi,8500,70
6,Nigeria,Abuja,35000,80
8,Togo,Lome,12000,75


at, iat, iloc and loc are accessors used to retrieve data in dataframes. iloc selects values from the rows and columns by using integer index to locate positions, while loc selects rows or columns using labels. at and iat are used to retrieve single values such that at uses the column and row labels and iat uses indices.

In [45]:
# To select the row in the index 3

df.iloc[3] 

Country        Togo
Capital        Lome
Population    12000
Age              75
Name: 8, dtype: object

In [46]:
# To select row with index LEBELED 8

df.loc[8]

Country        Togo
Capital        Lome
Population    12000
Age              75
Name: 8, dtype: object

In [47]:
# To select the Capital column
df['Capital']

2      Accra
4    Nairobi
6      Abuja
8       Lome
Name: Capital, dtype: object

In [48]:
# To select the country at index LABELED 6
df.at[6,'Country']

'Nigeria'

In [49]:
# To select the country at index 2
df.iat[2,0]

'Nigeria'

Similar to Numpy, Pandas has some functions that provide descriptive statistics such as the measures of central tendency, dispersion, skewness and kurtosis, correlation and multicollinearity. Some functions are mode(), median(), mean(), sum(), std(), var(), skew(), kurt() and min(). The describe function gives the summary  of the numeric columns in a dataframe displaying count, mean, standard deviation, interquartile range, minimum and maximum values.

In [50]:
df['Population'].sum()

65500

In [51]:
df.mean()

  df.mean()


Population    16375.00
Age              71.25
dtype: float64

In [52]:
df.describe()

Unnamed: 0,Population,Age
count,4.0,4.0
mean,16375.0,71.25
std,12499.166639,8.539126
min,8500.0,60.0
25%,9625.0,67.5
50%,11000.0,72.5
75%,17750.0,76.25
max,35000.0,80.0


The missing data enigma: Importance, types and handling missing data.
Often, data used for analysis in real life scenarios is incomplete as a result of omission, faulty devices, and many other factors. Pandas represent missing values as NA or NaN which can be filled, removed, and detected with functions like fillna(), dropna(), isnull(), notnull(), replace().



In [53]:
df_dict2 = {'Name':['James','Yemen','Caro',np.nan],'Profession':['Researher','Artist','Doctor','Writer'],'Experience':[12,np.nan,10,8],'Height':[np.nan,175,180,150]}
new_df = pd.DataFrame(df_dict2)
new_df

Unnamed: 0,Name,Profession,Experience,Height
0,James,Researher,12.0,
1,Yemen,Artist,,175.0
2,Caro,Doctor,10.0,180.0
3,,Writer,8.0,150.0


In [54]:
# Checking for cells with missing values as True
new_df.isnull()

Unnamed: 0,Name,Profession,Experience,Height
0,False,False,False,True
1,False,False,True,False
2,False,False,False,False
3,True,False,False,False


In [55]:
# Romove rows with missing values
new_df.dropna()

Unnamed: 0,Name,Profession,Experience,Height
2,Caro,Doctor,10.0,180.0


In [56]:
new_df

Unnamed: 0,Name,Profession,Experience,Height
0,James,Researher,12.0,
1,Yemen,Artist,,175.0
2,Caro,Doctor,10.0,180.0
3,,Writer,8.0,150.0


In [57]:
new_df.fillna(4)

Unnamed: 0,Name,Profession,Experience,Height
0,James,Researher,12.0,4.0
1,Yemen,Artist,4.0,175.0
2,Caro,Doctor,10.0,180.0
3,4,Writer,8.0,150.0


In [58]:
new_df['Name'].replace('James', 'John')

0     John
1    Yemen
2     Caro
3      NaN
Name: Name, dtype: object

In [59]:
new_df.notnull()

Unnamed: 0,Name,Profession,Experience,Height
0,True,True,True,False
1,True,True,False,True
2,True,True,True,True
3,False,True,True,True


In [60]:
new_df

Unnamed: 0,Name,Profession,Experience,Height
0,James,Researher,12.0,
1,Yemen,Artist,,175.0
2,Caro,Doctor,10.0,180.0
3,,Writer,8.0,150.0


In [61]:
new_df.replace(np.nan, 200)

Unnamed: 0,Name,Profession,Experience,Height
0,James,Researher,12.0,200.0
1,Yemen,Artist,200.0,175.0
2,Caro,Doctor,10.0,180.0
3,200,Writer,8.0,150.0
