## Introduction to Python Libraries(Numpy and Pandas)

### What are libraries?

What is Library?

Normally, a library is a collection of books or is a room or place where many books are stored to be used later. Similarly, in the programming world, a library is a collection of precompiled codes that can be used later on in a program for some specific well-defined operations. Other than pre-compiled codes, a library may contain documentation, configuration data, message templates, classes, and values, etc.

A Python library is a collection of related modules. It contains bundles of code that can be used repeatedly in different programs. It makes Python Programming simpler and convenient for the programmer. As we don’t need to write the same code again and again for different programs. Python libraries play a very vital role in fields of Machine Learning, Data Science, Data Visualization, etc.

Let us take a look at two libraries that would be important to us on our data analysis journey --- Numpy & Pandas.

### Numpy

NumPy is used for working with arrays. NumPy is short for "Numerical Python".

In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Arrays are very frequently used in data science, where speed and resources are very important.



### Understanding Numpy

Import numpy in your applications by adding the import keyword:



In [2]:
import numpy

Now NumPy is imported and ready to use. NumPy is usually imported under the np alias as shown below

In [4]:
import numpy as np

NumPy is used to work with arrays. The array object in NumPy is called ndarray. We can create a NumPy ndarray object by using the array() function.

In [5]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))

[1 2 3 4 5]
<class 'numpy.ndarray'>


### Dimensions in Arrays

0-D Arrays

0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.


In [9]:
### Example -- Create a 0-D array with value 7


import numpy as np

arr = np.array(42)

print(arr)

# You can also check the dimension
print(arr.ndim)

42
0


1-D Arrays

An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array. These are the most common and basic arrays.

In [10]:
### Example -- Create a 1-D array containing the values 1,2,3,4,5:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

# You can also check the dimension
print(arr.ndim)

[1 2 3 4 5]
1


3-D arrays


An array that has 2-D arrays (matrices) as its elements is called 3-D array. These are often used to represent a 3rd order tensor.

In [12]:
### Example -- Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)

# You can also check the dimension
print(arr.ndim)

[[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]]
3


### Numpy Array Indexing


Array indexing is the same as accessing an array element. You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

In [13]:
### Example -- Get the first element from the following array:

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[0])

1


In [14]:
### Example -- Get third and fourth elements from the following array and add them.

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[2] + arr[3])

7


In [15]:
### Example -- Access the element on the first row, second column:

import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('2nd element on 1st row: ', arr[0, 1])

2nd element on 1st row:  2


In [16]:
### Example -- Access the third element of the second array of the first array:

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print(arr[0, 1, 2])

6


In [17]:
### arr[0, 1, 2] prints the value 6.

# And this is why:

# The first number represents the first dimension, which contains two arrays:
# [[1, 2, 3], [4, 5, 6]]
# and:
# [[7, 8, 9], [10, 11, 12]]
# Since we selected 0, we are left with the first array:
# [[1, 2, 3], [4, 5, 6]]

# The second number represents the second dimension, which also contains two arrays:
# [1, 2, 3]
# and:
# [4, 5, 6]
# Since we selected 1, we are left with the second array:
# [4, 5, 6]

# The third number represents the third dimension, which contains three values:
# 4
# 5
# 6
# Since we selected 2, we end up with the third value:
# 6

### NumPy Array Slicing

Slicing in python means taking elements from one given index to another given index. We pass slice instead of index like this: [start:end]. We can also define the step, like this: [start:end:step].

If we don't pass start its considered 0 

If we don't pass end its considered length of array in that dimension

If we don't pass step its considered 1

P.S. The result includes the start index, but excludes the end index.

In [19]:
### Example -- Slice elements from index 1 to index 5 from the following array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5])

[2 3 4 5]


In [20]:
### Example -- Slice elements from index 4 to the end of the array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[4:])

[5 6 7]


In [21]:
### Example -- Return every other element from index 1 to index 5:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5:2])

[2 4]


In [22]:
### Example -- From the second element, slice elements from index 1 to index 4 (not included):

import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[1, 1:4])

[7 8 9]


In [23]:
### Example -- From both elements, return index 2:

import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 2])

[3 8]


### Shape of an Array

The shape of an array is the number of elements in each dimension. NumPy arrays have an attribute called shape that returns a tuple with each index having the number of corresponding elements.

In [24]:
### Example -- Print the shape of a 2-D array:

import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)


(2, 4)


The example above returns (2, 4), which means that the array has 2 dimensions, where the first dimension has 2 elements and the second has 4.

There are a number of attributes you might wat to take a look at and experiment with...

### Try Out

Use the followng attributes on any 1 dimensional, 2 dimensional and 3 dimensional arrays that you create;

1. Array Reshape
2. Array Join
3. Array Search
4. Array Sort

In [25]:
# You can type your code here





















### Basic Operations

1. Operations on single array: We can use overloaded arithmetic operators to do element-wise operation on array to create a new array. In case of +=, -=, *= operators, the exsisting array is modified.


In [27]:
# Python program to demonstrate
# basic operations on single array
import numpy as np
 
a = np.array([1, 2, 5, 3])
 
# add 1 to every element
print ("Adding 1 to every element:", a+1)
 
# subtract 3 from each element
print ("Subtracting 3 from each element:", a-3)
 
# multiply each element by 10
print ("Multiplying each element by 10:", a*10)
 
# square each element
print ("Squaring each element:", a**2)
 
# modify existing array
a *= 2
print ("Doubled each element of original array:", a)

Adding 1 to every element: [2 3 6 4]
Subtracting 3 from each element: [-2 -1  2  0]
Multiplying each element by 10: [10 20 50 30]
Squaring each element: [ 1  4 25  9]
Doubled each element of original array: [ 2  4 10  6]


2. Unary operators: Many unary operations are provided as a method of ndarray class. This includes sum, min, max, etc. These functions can also be applied row-wise or column-wise by setting an axis parameter.

In [28]:
# Python program to demonstrate
# unary operators in numpy
import numpy as np
 
arr = np.array([[1, 5, 6],[4, 7, 2],[3, 1, 9]])
 
# maximum element of array
print ("Largest element is:", arr.max())
print ("Row-wise maximum elements:",
                    arr.max(axis = 1))
 
# sum of array elements
print ("Sum of all array elements:",
                            arr.sum())

Largest element is: 9
Row-wise maximum elements: [6 7 9]
Sum of all array elements: 38


### Pandas

Pandas is a Python package that offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

It is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas gives you answers about the data. Like:

1. Is there a correlation between two or more columns?
2. What is average value?
3. Max value?
4. Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

### Understanding Pandas

Import pandas in your applications by adding the import keyword:

In [29]:
import pandas

Like we did in numpy pandas is usually imported under the pd alias.

In [30]:
import pandas as pd

In [31]:
# Let us see a quick example

import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Pandas Series

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.

In [32]:
### Example -- Create a simple Pandas Series from a list:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [33]:
### Example -- Return the first value of the Series:

print(myvar[0])

1


### DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames. Series is like a column, a DataFrame is the whole table.

You would be working a lot with data frames as you would start to analyse dataset in table formats.

In [42]:
### Example -- Create a DataFrame from two Series:

import pandas as pd

data = {
  "miles": [420, 380, 390],
  "speed": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)

   miles  speed
0    420     50
1    380     40
2    390     45


### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [43]:
### Example -- Return row 0:

#refer to the row index:
print(myvar.loc[0])

miles    420
speed     50
Name: 0, dtype: int64


### Try Out

What data type does this exaample return?

In [44]:
# Type your answer here as a comment




In [45]:
### Example -- Return row 0 and 1:

#use a list of indexes:
print(myvar.loc[[0, 1]])

   miles  speed
0    420     50
1    380     40


### Named Indexes
With the index argument, you can name your own indexes.

In [41]:
### Example -- Add a list of names to give each row a name:

import pandas as pd

data = {
  "miles": [420, 380, 390],
  "speed": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)


      miles  speed
day1    420     50
day2    380     40
day3    390     45


You can also locate the named Indexes by using the named index in the loc attribute to return the specified row(s).

In [46]:
### Example --Return "day2":

#refer to the named index:
print(df.loc["day2"])

miles    380
speed     40
Name: day2, dtype: int64


### Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.

You can get the details of the Titanic data to be used [here](https://www.kaggle.com/competitions/titanic/data?select=train.csv)

In [72]:
### Example -- Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd

dfLoad = pd.read_csv('titanicTrainData.csv')

print(dfLoad) 

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

We can also read JSON files with pandas
Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In [52]:
import pandas as pd

df = pd.read_json('dataset.json')

print(df)
      

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


### Pandas - Analyzing DataFrames

###  Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.


In [53]:
### Example -- Get a quick overview by printing the first 10 rows of the DataFrame:
    
import pandas as pd

df = pd.read_csv('titanicTrainData.csv')

print(df.head(10)) 

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54

There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [54]:
import pandas as pd

df = pd.read_csv('titanicTrainData.csv')

print(df.tail(10)) 

     PassengerId  Survived  Pclass                                      Name  \
881          882         0       3                        Markun, Mr. Johann   
882          883         0       3              Dahlberg, Miss. Gerda Ulrika   
883          884         0       2             Banfield, Mr. Frederick James   
884          885         0       3                    Sutehall, Mr. Henry Jr   
885          886         0       3      Rice, Mrs. William (Margaret Norton)   
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked  
881    male  33.0      0      0            

### Info About the Data
The DataFrames object has a method called info(), that gives you more information about the data set.

In [55]:
### Example -- Print information about the data:

print(df.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


We can see that there are 12 columns in this dataset; we also notice the null count column of our info having variances in value and this is due to some of the columns having missing values and value we observe the data types of the various columns.

### Let us dive into the data some more with pandas as we clean the data up

### Data Cleaning
Data cleaning means fixing bad data in your data set. When working with multiple data sources, there are many chances for data to be incorrect, duplicated, or mislabeled. If data is wrong, outcomes and algorithms are unreliable, even though they may look correct. Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset. There’s no such absolute way to describe the precise steps in the data cleaning process because the processes may vary from dataset to dataset. Data cleansing, data cleansing, or data scrub is that the initiative among the general data preparation process. Data cleaning plays an important part in developing reliable answers and within the analytical process and is observed to be a basic feature of the info science basics. The motive of data cleaning services is to construct uniform and standardized data sets that enable data analytical tools and business intelligence easy access and perceive accurate data for each problem.



Bad data could be:

* Empty cells
* Data in wrong format
* Wrong data
* Duplicates

### Handling Empty Cells

### Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result but in our titanic data that might be more hamrful in our analysis.

In [63]:
### Example -- Return a new Data Frame with no empty cells:

import pandas as pd

df = pd.read_csv('titanicTrainData.csv')

new_df = df.dropna()

print(new_df)

     PassengerId  Survived  Pclass  \
1              2         1       1   
3              4         1       1   
6              7         0       1   
10            11         1       3   
11            12         1       1   
..           ...       ...     ...   
871          872         1       1   
872          873         0       1   
879          880         1       1   
887          888         1       1   
889          890         1       1   

                                                  Name     Sex   Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                              McCarthy, Mr. Timothy J    male  54.0      0   
10                     Sandstrom, Miss. Marguerite Rut  female   4.0      1   
11                            Bonnell, Miss. Elizabeth  female  58.0      0   
..                                                 ...     ...   ... 

In [64]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  183 non-null    int64  
 1   Survived     183 non-null    int64  
 2   Pclass       183 non-null    int64  
 3   Name         183 non-null    object 
 4   Sex          183 non-null    object 
 5   Age          183 non-null    float64
 6   SibSp        183 non-null    int64  
 7   Parch        183 non-null    int64  
 8   Ticket       183 non-null    object 
 9   Fare         183 non-null    float64
 10  Cabin        183 non-null    object 
 11  Embarked     183 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB


### Try Out

Based on the info above would you suggest removing all the rows with empty values? If yes or No state your reasons.

In [65]:
# You can type your answer here









### Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead. This way you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

In [67]:
###Example -- Replace NULL values with the number 1:

import pandas as pd

df = pd.read_csv('titanicTrainData.csv')

df.fillna(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,1,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,1,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,1,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,1,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1.0,1,2,W./C. 6607,23.4500,1,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for the DataFrame:

In [68]:
### Example -- Replace NULL values in the "Cabin" columns with the number 1:

import pandas as pd

df = pd.read_csv('titanicTrainData.csv')

df["Cabin"].fillna(1)

0         1
1       C85
2         1
3      C123
4         1
       ... 
886       1
887     B42
888       1
889    C148
890       1
Name: Cabin, Length: 891, dtype: object

### Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

In [70]:
### Example -- Calculate the MEAN, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('titanicTrainData.csv')

x = df["Age"].mean()

df["Age"].fillna(x)

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

### If you try the above code with the Cabin column do you think it would work?

### Data Correlations

We can use pandas to find the relationship between columns and how correlated they are. The corr() method calculates the relationship between each column in your data set.

In [73]:
### Example -- Show the relationship between the columns:
### We are using the initially loaded data
dfLoad.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


You would notice that diagonal values are 1.0000000 meaning that they are optmally correlated to themelves which is logical.

### Try Out
What do you think each of these values mean?