# Introduction to Python

This brief introduction to the Python programming language aims to explain the Python concepts required in Data Science in a concise manner

## Table of Content


1. Types of Numbers in Python

2. Common Math Operations

3. Comments in Python

4. Variables

5. Case-sensitivity

6. Data types

7. Built-in Data Structures

8. Built-in Functions

9. Conditions

10. Loops

11. Custom Functions

#### String


Strings have already been discussed earlier in the section for Comments. However, there are some operations called "methods" that we can perform on strings to help us manipulate strings the way we want.

There are several methods for manipulating strings but the most important methods have been given below:

The various string methods we shall look at are:

**slicing<br>
strip<br>
lstrip<br>
rstrip<br>
strip with character<br>
replace<br>
split(, maxsplit = ...)<br>
rsplit(, maxsplit)<br>
.join()<br>
.upper, .lower, .capitalize<br>
.islower, isupper<br>
isalpha, isnumeric, isalnum<br>
count()<br>
.find()<br>
.rfind()<br>
.startswith<br>
.endswith<br>
partition(seperator)<br>
f-strings<br>
swapcase()<br>
len()<br>**

In order to make use of the methods, it is imperative to always use a dot "." before using the name of the method.

For example, given a string "I am a boy". 

To make all the letters in the statement capitalized, we use the method **.upper()**. This is shown below:

In [None]:
"I am a boy".upper()

You might have noticed that the method .upper() has a bracket after it. For most methods in python, we have a bracket after them and this is just the convention. The bracket also serves other uses which will be discussed later on in this notebook.

For now, just know that there are brackets that accompany methods in most cases but in some cases, there might not be a bracket.

To explore these string methods in python, we shall use the string "My name is Modupe". We will assign it to a variable x.

This is so that we can work with variable x, instead of the actual string "My name is Modupe". 

The reason for doing this is mostly for convenience since it is easier to type x, instead of the statement "My name is Modupe".

In [1]:
x = "My name is Modupe"

**String Slicing:**

This is the a piece of code that helps us to divide strings up or retrieve only certain parts of a string. These portions of the strings are accessed using the **indices** of the string.

The **indices** of the string are the numbers that represent the position of the indvidual letters in the string.

For example,

in the string x = "My name is Modupe", The first "M" in the sentence takes up index 0. The "y" takes up index 1, the space after these two letters takes up index 2, next the "n" takes up index 3 and so on.

To access the indices from behind, we use negative numbers. The last letter "e" is indexed by -1, the letter "p" is indexed by -2, and so on.

In [None]:
x[0]

x[1]

x[5]

x[-1]

x[-2]

So far, we have been able to select the particular letter of the string that we want. But what if we want multiple letters? What if we want whole sections of a string but still not the entire thing? This is where string slicing really shines.

In this case, we select the index where we want the portion of string we want, to start from and then also choose the end index where you want the string to stop. So basically, you are left with the following syntax for string slicing:

string(beginning_index : end_index]. The colon (:) signifies that you want a range of indices between the beginning_index and the end_index.

In [3]:
#Slicing

x[3:5]

x[0:]

x[:-1]

x[2: -1]

x[3:-4]


'na'

**Strip:**

This is the functionality of removing either spaces or other specified characters from the beginning or end of a string. For example, if we had the following string:

@username

We can use strip to remove the at and then derive the username itself. We can also use strip on the following string:

HHData Science ClassHH

The strip() method can be used to remove the two h's on either side of the string "Data Science Class"


In the example given below, spaces have been placed at the beginning and end of the statement. The strip method removes the spaces after being run agains the string.

To remove other characters other than spaces, we can place the character itself in quotes within the brackets of the strip method. This is infact, how you deal with the "HHData Science ClassHH" or "username" examples given earlier. The syntax would be:

string.strip("HH") for HHData Science ClassHH and string.strip("@") for @username

In [4]:
#Strip, lstrip, rstrip, strip with character
x = "    My name is Modupe   "
x.strip()


'My name is Modupe'

In [None]:
x = "    My name is Modupe   "

x.strip()

#x.lstrip()

#x.rstrip()

In [None]:
x = "##My name is Modupe##"

x.strip("#")

**Replace:**

This is used to replace a letter/character within a string with another string. For example given a string "Ade", we can replace the letter "e" with the letter "a" in the following way:

string.replace("e", "a")

In [None]:
#Replace
x.replace("#", "*")

**Split:**

This is used to split a string into individual portions based on a specified separator. For example,

"Ade is a boy" can be separated by the space between each word by typing: string.split(" "). The result is:
("Ade", "is", "a", "boy")


We can also split by any other character as well.

In [5]:
#split, rsplit

x.split(" ")

#x.split(" ", 2)

#x.rsplit(" ", 2)

['', '', '', '', 'My', 'name', 'is', 'Modupe', '', '', '']

**Join:**:

The join method bascially does the opposite of the split method. It takes a list of string items and then brings them together to form one singular string.

In [55]:
#join

y = ["John", "is", "a", "boy"]

" ".join(y)

'John is a boy'

In [60]:
string1 = " ".join(y)

string1.split("o")

['J', 'hn is a b', 'y']

**Upper**:

The upper method simply converts all the letters in a string to uppercase.


**Lower**:
The lower method converts all the letters in a string to lowercase.


**Capitalize**:
This method converts the first letter alone to uppercase and changes all others to lowercase


**Islower**:
This method checks if a string is completely in lowercase


**Isupper**:
This checks if a string is completely in uppercase

In [65]:
#Upper, Lower, Capitalize

x.upper()

#x.lower()

#x.capitalize()

'Iyiola'

In [73]:
#isupper, islower

x.isupper()

#x.islower()

False

**Isalpha**:
Used to check if the expression inside a string is full of only letters of the alphabet

**Isnumeric**:
Used to check if the expression inside a string is full of only numbers. It does the same job as **Isdigit**. The difference between them is subtle enough to ignore for now.

**Isalnum**:
Used to check if the expression inside a string is full of letters and numbers at the same time.

In [83]:
#isalpha,isnumeric,isalnum,isdigit

x = "45, 56"

#x.isalpha()

x.isnumeric()

#x.isalnum()

#x.isdigit()

False

**Count**:

This is used to count all the occurences of a particular character in a string

In [91]:
#count
x = "My name is ModupMe Iyiola Priscillia"
x.count(" ")

#x.count("m")

5

In [112]:
for index, letter in enumerate(x):
    if letter == "M":
        print(f"The index is: {index}")
        

The index is: 0
The index is: 11
The index is: 16


**Find**:

This is used to return the index of a particular character within a string. It runs on a lazy algorithm that finds only the first instance of the character you are looking for without going further to see if the same character exists in the following portion of the string.

In [93]:
#Find

x.find("M")

x.rfind("M")

16

**Startswith**:

This is used to check if a string starts with a specified character or set of characters or not.



**endswith**:

This is used to check if a string ends with a speciifed character or set of characters or not.

In [98]:
#startswith

x.startswith("M")

True

In [None]:
#endsiwth

x.endswith("j")

**Partition**:

This is used to break a string down such that after finding the first occurrence of the separator, the string is split into 3 items in a list: The first set of string characters before the first instance of the separator, the separator itself, and the remaining part of the string

In [100]:
#partition by a character or word

x.partition(" ")

('My', ' ', 'name is ModupMe Iyiola Priscillia')

**F-strings**:

These are strings that allow you to add another string to them as a variable.

In [101]:
#F-strings

f"{x}"

y = "Tolu"

f"{x} and {y} is my friend"

'My name is ModupMe Iyiola Priscillia and Tolu is my friend'

**Swapcase**:

This is a method that allows us to make the capitalized letters, become lowercase and at the same time, make lowercase letters become capitalized, which in effect, switches the cases of the letters, hence its name "swapcase". For example,

"JAre" is swapped to become "jaRE"

In [110]:
#Swapcase

x[0:10].swapcase() + x[10:]

'mY NAME IS ModupMe Iyiola Priscillia'

In [107]:
50 + 35

85

**Len**:

The Len method is used to calculate the length of a string. This counts all the instances of letters, spaces, and any other character within the quotation marks representing the string.
    

In [111]:
#len

len(x)

36

In [113]:
len("Ben is a boy")

12

**STRING CONCATENATION**:

This is the process of adding two pieces of string together to get a larger string. For example "40" + "20" will not give us "60", but rather, it will give us "4020" because these are numbers which are strings. Python strings can only perform string concatenation when the "+" sign is used.

In [None]:
first_name = "Olujare"

last_name = "Dada"


first_name + " " + last_name

**Using String manipulation and the already discussed techniques in Pandas**

So far, you might be wondering why we are learning some of these string manipulation techniques among the other things discussed earlier.

On their own, these concepts are not very useful and this is why using them to solve certain examples that come up in reality is a good way to understand their usefulness.

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It helps us to visualize the datasets we have in tabular form. It also provides a number of data mining, manipulation, and cleaning functionalities which will be useful for us in our journey.

Some of these are discussed below, as they concern a number of topics already covered.

The next line of code is not compulsory to know just yet, but it is well-described for all who are interested.

This code bit is for initializing pandas and then telling pandas to read our dataset file.

In [17]:
#This line initializes pandas and all its functionalities. This is how many packages for data science and python in general
#are initialized.

#Normally, using only "import pandas" would be good enough but we use "as pd" to create an alias for pandas within our code.
#This is so as to help us refer to pandas in our code by just typing "pd" instead of typing the full name "pandas"
import pandas as pd



#Here, we use the alias of pandas, "pd", to read our .csv file which contains the data we want to analyze. This is the standard
#method for importing .csv data into pandas. We shall look into other data formats and other slightly more complicated data
#imports in later classes.
data = pd.read_csv("working_with_tables.csv")


#data.head() is simply tellling pandas that we want to view the first 5 rows of the data

data.head()

Unnamed: 0,Firstname,Lastname,age,sales,cost
0,Olanrewaju,Kazeem,[24],15000,11000
1,Chibuzo,Ekenne,[30],25800,15200
2,Onyinyechi,Amos,[22],45000,28000
3,Orlando,Bloom,[25],28000,22000
4,Alex,Iwobi,[34],36500,18850


Pandas has displayed the table for us. The table contains 5 rows: row 0, row 1, row 2,..., row 4. The table contains 5 columns:
"Firstname", "Lastname", "age", "sales", "cost". **It is important to note that rows are the horizontal fields while columns are the vertical fields in the dataset, always**

From the above table, straightaway, we can see that the ages are in square brackets which are not the best for analysis. Also, we can also see that the first and last names of each person are separated. Sometimes, it might make sense to join them together. Lastly, we might also want to calculate the revenue generated by each sales representative by subtracting the cost from the amount of sales made.


For these 3 things outlined above, we already have enough information to solve them with relative ease. We shall go through each problem sequentially with simple steps to follow.

But first, we must discuss how to access each column.

In [19]:
data["Firstname"] #This displays all the values in the "Firstname" column

data["Lastname"] #This displays all the values in the "Lastname" column

data["age"] #This displays all the values in the "age" column

Unnamed: 0,Firstname,Lastname,age,sales,cost
0,Olanrewaju,Kazeem,[24],15000,11000
1,Chibuzo,Ekenne,[30],25800,15200
2,Onyinyechi,Amos,[22],45000,28000
3,Orlando,Bloom,[25],28000,22000
4,Alex,Iwobi,[34],36500,18850
5,Kachi,Felix,[50],44235,33218
6,Olujare,Dada,[28],27500,25800
7,Tomiwa,Sogaolu,[42],50509,34850
8,Timilehin,Kupolokun,[49],88390,76001
9,Nnena,Dickson,[39],122200,85900


While there are more ways to access columns and even rows from the dataset, the above examples suffice for now.

In [36]:
data["Firstname"].str[3:5]

0    nr
1    bu
2    in
3    an
4     x
5    hi
6    ja
7    iw
8    il
9    na
Name: Firstname, dtype: object

### Sorting out the Names


Remember string concatenation? Well, that is what we shall use to join the first and last names together. The only difference is that instead of creating our own variables like x = "Aiex" and y = "Iwobi" and then saying x + " " + y = "Alex Iwobi", rather, we use the table column names as our variables.

Using the table column names ensures that the entire change is made throughout the entire table. An example of this is displayed below:

In [20]:

data["Firstname"] +" "+ data["Lastname"]

0      Olanrewaju Kazeem
1         Chibuzo Ekenne
2        Onyinyechi Amos
3          Orlando Bloom
4             Alex Iwobi
5            Kachi Felix
6           Olujare Dada
7         Tomiwa Sogaolu
8    Timilehin Kupolokun
9          Nnena Dickson
dtype: object

This has produced a similar result to our string concatenation discussed earlier. However, it is important to note that:
**Pandas has not made this change permanent yet. We can easily spot when pandas has not made a data manipulation step permanent because it returns the tabular result of whatever manipulation we are after. If we want the change to be permanent, we should equate the data manipulation step to a new column name**

This step is carried out below. But we will check if the string concatenation we performed had any impact on our table.

In [31]:
data.head()

Unnamed: 0,Firstname,Lastname,age,sales,cost,Fullname
0,Olanrewaju,Kazeem,24,15000,11000,Olanrewaju Kazeem
1,Chibuzo,Ekenne,30,25800,15200,Chibuzo Ekenne
2,Onyinyechi,Amos,22,45000,28000,Onyinyechi Amos
3,Orlando,Bloom,25,28000,22000,Orlando Bloom
4,Alex,Iwobi,34,36500,18850,Alex Iwobi


Clearly, the data manipulation step did not work. As a result, we need to assign the result of the data manipulation to another column name. This is done below:

In [24]:
data["Fullname"] = data["Firstname"] +" "+ data["Lastname"]

In [22]:
data.head()

Unnamed: 0,Firstname,Lastname,age,sales,cost,Fullname
0,Olanrewaju,Kazeem,[24],15000,11000,Olanrewaju Kazeem
1,Chibuzo,Ekenne,[30],25800,15200,Chibuzo Ekenne
2,Onyinyechi,Amos,[22],45000,28000,Onyinyechi Amos
3,Orlando,Bloom,[25],28000,22000,Orlando Bloom
4,Alex,Iwobi,[34],36500,18850,Alex Iwobi


### Sorting out Ages


We can see that the ages of the sales representatives are in square brackets. We can use the .strip method discussed in string methods above. We just need to signify what character we are stripping away from the values in ages column. In this case, we want to get rid of the square brackets. This is done below:

In [27]:
data["age"].str.strip("[")

0    24]
1    30]
2    22]
3    25]
4    34]
5    50]
6    28]
7    42]
8    49]
9    39]
Name: age, dtype: object

In [28]:
data["age"].str.strip("]")

0    [24
1    [30
2    [22
3    [25
4    [34
5    [50
6    [28
7    [42
8    [49
9    [39
Name: age, dtype: object

Notice that there is a method "str" being used before we apply the standard python string method, strip. This ".str" is used to access each value in the column we are interested in as individual strings that python can work on. Put differently, because we use ".str", we are able to apply standard python string methods on each string within a column. Without it, no python string method can work on a pandas table. So "str" helps us to access the strings in the tables, which then helps us to use the standard python string methods discussed above.

Now, all that is left to do is to assign the result to the table column name to make the changes parmanent.

In [30]:
data["age"] = data["age"].str.strip("[")

data["age"] = data["age"].str.strip("]")

data.head()

Unnamed: 0,Firstname,Lastname,age,sales,cost,Fullname
0,Olanrewaju,Kazeem,24,15000,11000,Olanrewaju Kazeem
1,Chibuzo,Ekenne,30,25800,15200,Chibuzo Ekenne
2,Onyinyechi,Amos,22,45000,28000,Onyinyechi Amos
3,Orlando,Bloom,25,28000,22000,Orlando Bloom
4,Alex,Iwobi,34,36500,18850,Alex Iwobi


***NOTE: IT IS ONLY WHEN YOU NEED TO USE STRING METHODS THAT YOU NEED "str". OTHER CALCULATIONS JUST REQUIRE WORKING WITH THE COLUMN NAMES ALONE***

### Creating the Revenue column


Here, we again use one of the concepts we have learned earlier. This time, we are subtracting real numbers, not strings.

Therefore, just as we did with the string columns, we do not need to create new variables for this operation, we only need to use the name of the columns in the table.

In [32]:
data["sales"] - data["cost"]

0     4000
1    10600
2    17000
3     6000
4    17650
5    11017
6     1700
7    15659
8    12389
9    36300
dtype: int64

In [33]:
data["Revenue"] = data["sales"] - data["cost"]

data.head()

Unnamed: 0,Firstname,Lastname,age,sales,cost,Fullname,Revenue
0,Olanrewaju,Kazeem,24,15000,11000,Olanrewaju Kazeem,4000
1,Chibuzo,Ekenne,30,25800,15200,Chibuzo Ekenne,10600
2,Onyinyechi,Amos,22,45000,28000,Onyinyechi Amos,17000
3,Orlando,Bloom,25,28000,22000,Orlando Bloom,6000
4,Alex,Iwobi,34,36500,18850,Alex Iwobi,17650


We have subtracted the Cost column from the Sales column and it has given us the Revenue column. However, we can also perform other simple arithmetic tasks such as addition, multiplication and division.

### Homework

Use the working_with_tables_homework.csv file to answer the following questions.

- Write a program to calculate the length of the string

- Write a program to get a string made of the first 2 and last 2 characters
from a given string.

- Write a program to get a string made of the first 2, a middle letter,
and last 2 characters from a given string.

- Write a python program to add "ing" at the end of a given string.

- Write a python program to remove the nth index character from a string.


- Create a new column that generates the initials of the sales representatives. The initials must be in the form:

> The first letter of their first name + . + The first letter of their last name. For example:
> Olanrewaju Kazeem would be: O.K


- If the commision that each sales representative is entitled to is 15% of the revenue generated, create a new column that generates this commission column.


- Suppose we are interested in knowing the character length of each peron's fullname for the new app the company is creating, create a new column containing the character length for each sales representative.


- In the new app to be created, the first 4 characters of the employee's full name is to be used as user name along with the first 4 characters of the employee's last name. These two pairs of 4 characters must be separated by an underscore. For example,

> Olanrewaju Kazeem will have a username of: Olan_Kaze
> - Create a new column that contains this new username.

### Built-in Data Structures


These are simply objects in python that help us store and manipulate data.

So far, we have discussed how numbers and strings themselves can be manipulated and stored in variables. This level of manipulation and storage is not sufficient for Data Science. We are going to be working with millions of rows and columns of data and simply working with variable assignment and string manipulation won't cut it. This is one of the main reasons we need to learn Data Structures.


For now, we would focus on the Built-in Data Structures within the Python programming language. These structures have parallels in other programming languages as well. So they are universal for anyone trying to use a programming language to achieve anything at all, from website creation, to hacking.


The 4 major Data Structures are:

- Lists

- Tuples

- Sets

- Dictionaries

#### Lists:


This is simply a collection of items. Items could be numbers, strings, or any other type of objects we have yet to discuss in python.

It has certain properties that sets it appart from other Data Structures, but these properties will be discussed later. For now, we introduce these data structures.

**A list with numbers:**

In [None]:
A = [2, 4, 6, 8]

A

In [117]:
type(None)

NoneType

**A list with strings:**

In [None]:
B = ["Modupe", "Iyiola", "Priscillia"]

B

A list can contain a mixture of numbers and strings and even other lists. It can contain a Tuple, a Set or Dictionaries.

In [None]:
y = ["kunle", 5, 10.5, ["another list", "is", "in", "this", "list y"]]
y

**Tuple**:

This can also hold any type of objects just like a list. And just like lists, it has its set of unique properties discussed later

In [None]:
C = (2, 4, 7, 8)

C

**Sets**:


This is another collection of items, only this time, the items must not be duplicated and they must be arranged in ascending order

In [118]:
D = {2, 1, "two", 2, 3, 3, 0}

D

{0, 1, 2, 3, 'two'}

**Dictionaries**:

A dictionary contains key-value pairs. A value can easily be accessed via its key. For the example given below, the keys are: "one", "two", and "three", while the values are 1, 2 and 3 respectively.

In [119]:
E = {"one": 1, "two" : 2, "three": 3}

E

{'one': 1, 'two': 2, 'three': 3}

**Properties of Data Structures**:


**List**:
- Lists are ordered.
- Lists can contain any arbitrary objects.
- List elements can be accessed by index.
- Lists can be nested to arbitrary depth.
- Lists are mutable.
- Lists are dynamic.


**Tuple**:

- Ordered, 
- Unchangeable
- Allow duplicate values.


**Set**:
- Sets are unordered, as a result, they cannot be indexed.
- Set elements are unique. Duplicate elements are not allowed.
- A set itself may be modified, but the elements contained in the set must be of an immutable type.


**Dictionary**:
- Dictionaries are unordered.
- Keys are unique.
- Keys must be immutable. 
- Values of dictionaries can be accessed by the keys


Below, we shall proceed to explain these properties.

**List**

In [7]:
# Lists are ordered

L = ["a", "c", "b"]

M = ["a", "b", "c"]

N = ["a", "b", "c"]

L == M

M == N

True

In [124]:
# Lists can contain any arbitrary objects.

L = ["one", 2, "three", 4, None]

L

['one', 2, 'three', 4, None]

In [125]:
# Lists are accessed by index

L[0]

L[1]

L[2]

L[3]

4

In [48]:
# Lists can be nested

M = ["Father", "Mother", ["Brother", "Sister"]]

M[0]

M[1]

M[2]

M[2][0]

M[2][1]

'Sister'

In [49]:
# Lists can be mutable or changeable

N = ["Apple", "Samsung", "Nokia"]

N += M

N

['Apple', 'Samsung', 'Nokia', 'Father', 'Mother', ['Brother', 'Sister']]

In [None]:
N[0] = 1

N

**Tuple**

In [12]:
# Tuples are ordered

T = (1, "a", 2, "b")

U = (1, 2, "a", "b")

T == U

False

In [38]:
# Tuples are unchangeable

T[1] = 19

TypeError: 'tuple' object does not support item assignment

In [43]:
# Allow duplicates
R = (1, 1, 4, 2, 4, 4, 2)

R

(1, 1, 4, 2, 4, 4, 2)

In [44]:
R[2] #Tuples can be indexed

4

Sets

In [35]:
D = {2, 1, "two", 2, 3, 3, 0}

D #Duplicates have been removed

{0, 1, 2, 3, 'two'}

In [42]:
D[2] #sets cannot be indexed

TypeError: 'set' object is not callable

In [41]:
D.add(13)  #Sets can only be updated the other elements previously in the set cannot be changed.

D

{0, 1, 13, 2, 3, 'two'}

In [45]:
K = {1, 2, 0, "two", 3, 2, 3}
D = {2, 1, "two", 2, 3, 3, 0}

#Sets are unordered. This is why the exact same set representations are equal
K == D

True

**Dictionary**:

In [6]:
E = {"one": 1, "two" : 2, "three": 3}

#Keys are unique
E["one"] = 2 # Dictionaries can be indexed using their keys

E

{'one': 2, 'two': 2, 'three': 3}

Notice that when we try to create another element in the dictionary that has the same key but a different value, the value attached to the key "one" just got updated instead of creating duplicate keys with different values.

**Operations with lists and list methods**

- append: This is used to add a new element to the end of a list.
- clear: This is used to remove all the elements within a list.
- copy: This is used to make a copy of a list.
- count: This is used the number of instances of an element within a list.
- extend: This adds all the elements of an iterable (list, tuple, string etc.) to the end of the list
- index: This is used to return the index of a specified element within a list.
- insert: This is used to insert the specified value at the specified position. Syntax. list.insert(pos, elmnt).
- pop: This is used to remove the last element added to the list.
- remove: This is used to remvoe a specific element from the list.
- reverse: This is used to reverse the order of a list.
- sort: This is used to sort a list in ascending or descending order.

In [50]:
N

['Apple', 'Samsung', 'Nokia', 'Father', 'Mother', ['Brother', 'Sister']]

In [66]:
N.pop()
N

['Apple',
 'Samsung',
 'Nokia',
 'Father',
 'Mother',
 ['Brother', 'Sister'],
 'Father']

In [67]:
N.append("Last_element")
N

['Apple',
 'Samsung',
 'Nokia',
 'Father',
 'Mother',
 ['Brother', 'Sister'],
 'Father',
 'Last_element']

In [68]:
new_N = N.copy()

new_N

['Apple',
 'Samsung',
 'Nokia',
 'Father',
 'Mother',
 ['Brother', 'Sister'],
 'Father',
 'Last_element']

In [70]:
N.count("Father")

2

In [73]:
N.extend(L)
N

['Apple',
 'Samsung',
 'Nokia',
 'Father',
 'Mother',
 ['Brother', 'Sister'],
 'Father',
 'Last_element',
 'a',
 'c',
 'b']

In [78]:
N.index("Mother")

4

In [79]:
N.insert(10, "Jare")
N

['Apple',
 'Samsung',
 'Nokia',
 'Father',
 'Mother',
 ['Brother', 'Sister'],
 'Father',
 'Last_element',
 'a',
 'c',
 'Jare',
 'b']

In [81]:
N.remove("Nokia")

N

['Apple',
 'Samsung',
 'Mother',
 ['Brother', 'Sister'],
 'Father',
 'Last_element',
 'a',
 'c',
 'Jare',
 'b']

In [82]:
N.reverse()

N

['b',
 'Jare',
 'c',
 'a',
 'Last_element',
 'Father',
 ['Brother', 'Sister'],
 'Mother',
 'Samsung',
 'Apple']

In [88]:
L.sort()

L

['a', 'b', 'c']

In [89]:
L.sort(reverse = True)

L

['c', 'b', 'a']

### Operations and Methods with Sets


- add: This is used to add a single value to the elements of a set.
- clear: This is used to clear all the elements of a set.
- copy: This is used to make an extra copy of a set
- difference: Returns a set containing the difference between two or more sets
- difference_update: Removes the items in this set that are also included in another, specified set
- discard: This is used to remove an element from a set. Just like remove().
- intersection: This is used to find the common elements between two sets.
- intersection_update: Removes the items in this set that are not present in other, specified set(s)
- isdisjoint: Is used to check if two sets do not have any common elements.
- issubset: Is used to check if a set is the subset of another set. If set A is a subset of set B, then, all of set A's elements can be found in set B.
- issuperset: Is used to check if a set is the superset of another set. If set A is a superset of set B, then, set A at least contains all the elements of set B, and contains some extra elements not found in set B.
- pop: Remove the last element added to a set
- remove: To remove a specified element from a set
- symmetric_difference: The set which contains the elements which are either in set A or in set B but not in both
- symmetric_difference_update: This takes the symmetric difference and applies it to create a new set entirely.
- union: This is for creating a combination of all the unique elements of two or more sets.
- update: This is used to add sequence values to a set such as any iterables including list , tuple , string , dict etc.

In [2]:
D.

{0, 1, 2, 3, 'two'}

### Homework


- Use the set methods listed above to perform operations involving sets.

- Find and use Tuple methods to perform operations involving tuples.

- Find and use Dictionary methods to perform operations involving dictionaries.

- In the working_with_tables_homework.csv file, find a way to separate the email service provider from each sales representatative's email into a separate column.

- There are 8 students in a room: Alicia, Ope, Fiyin, Emeka, Kenny, Timothy, Grace, and Kunle. Use this information to answer the following questions:
> 1. If one more person named Emeka joins these 8 students, which data structure can we use to represent this information?
> 2. If there are two groups, A and B. Ope, Fiyin, Timothy, and Kunle are in group A while, Alicia, Emeka, Kenny, Timothy, and Ope are in group B. Which data structure can help us find the students common to both groups?
> 3. If we want these 8 students to vote for who would be class president between Sandra and Michael, and we want to be able to know the order in which the students voted (e.g Emeka voted first, Alicia voted second, and so on), what data structure should we use?
> 4. If the scenario in question 3 is repeated, but this time, we do not care about the order in which the students voted, what data structure should we use?
> 5. If we wanted to record the number of votes that Michael and Sandra got from the election, which data structure is best suited for the task?

### Converting between Data Structures

**Converting Lists to Tuples or Sets**

In [11]:
L
tuple(L)
set(L)

{'a', 'b', 'c'}

**Converting Tuples to Lists or Sets**

In [15]:
T
list(T)
set(T)

{1, 2, 'a', 'b'}

**Converting Sets to Tuples or Lists**

In [19]:
D
list(D)
tuple(D)

(0, 1, 2, 3, 'two')

**Converting Dictionaries to Tuples, Lists, or Sets:**

Since a dictionary contains two pieces of linked information: keys and values, it follows that we should be able to create two children data structures from the keys and values of a dictionary.

In [25]:
list(E) #This takes just the keys and converts them to a list. Another way to do this is: list(E.keys())
tuple(E) #This takes just the keys and converts them to a tuple. Another way to do this is: tuple(E.keys())
set(E) #This takes just the keys and converts them to a set. Another way to do this is: set(E.keys())


list(E.items()) #This creates a list of tuples containing the key-value pairing.
tuple(E.items()) #This creates a tuple of tuples containing the key-value pairing.
set(E.items()) #This creates a set of tuples containing the key-value pairing.


list(E.values()) #This takes just the values and converts them to a list.
tuple(E.values()) #This takes just the values and converts them to a tuple.
set(E.values()) #This takes just the values and converts them to a set.

{('one', 2), ('three', 3), ('two', 2)}

### Using Built-in Data structures to create Pandas Objects


The Built-in Data Structures discussed so far are very useful but as Data Scientists, we would most likely deal more with tables than with lists, sets, dictionaries or tuples. Since data in table form is a more appropriate way to represent data, the primitive data structures learned so far would not suffice in the long run.

It is therefore imperative to find a way to link what we know, with what we will use in the near future, pandas table representations. There are 4 ways to represent data in pandas, they are:

- **Pandas Series**: A Series is a one dimensional array like structure with homogeneous data. The values of the series are mutable means we can change any value in series, but the size of the series is immutable so we can not change the size of the series.

- **Pandas DataFrames**:  DataFrame is a two-dimensional array with heterogeneous data. A DataFrame size is mutable and data is also mutable so we can change the data and size of DataFrame at any time.

- **Pandas Panel**: The panel is a three-dimensional data structure with heterogeneous data. It is very hard to represent the Panel in a graphical representation. But a Panel can be illustrated as a container of a DataFrame. In Panel Data and size are mutable. 

Below is a picture depicting the above information:

![convert notebook to web app](https://miro.medium.com/max/720/1*1iHWBaNA9d_ArysIiuit8A.webp)

From the above representations given above, it does not take a Rocket Scientist to decipher that Lists and Tuples can be used to create Pandas Series objects. This is because they are single column data structures, just like the Built-in data structures of Lists and Tuples. 
***(Note: Sets are not used because they are unordered. This means that they do not have indicies that the pandas con us to create its own indicies)***

Out of the Built-in data structures, only Dictionaries can take in two different types of data linked in a key-value pair relationship. This implies that we can use it to represent a data structure with multiple columns.

At this point, we can proceed to show how to create these pandas objects from the Built-in Data Structures.

(***Note: This is important because we would be using Pandas objects heavily in the near future***).

#### Initializing Pandas

In [26]:
import pandas as pd 

#Remember pd is just an alias. You can name your alias anything. But pd is the convention so that other
#programmers can read your code with ease

#### Pandas Series from Lists

In [32]:
ls1 = [2, 4, 6, "eight", 10]
#Notice that for the above the list, the series created is recognized as an "Object". An "Object" is a string in Pandas.
#Also, the reason why this is so, is because of the presence of the value "eight". If the value were 8 instead of "eight",
#pandas would recognize it as an integer, or int64 or int32, or int16.

ls1 = [2, 4, 6, 8, 10]

pd.Series(ls1)

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [58]:
#To add a column name, we use the "name" parameter.

pd.Series(ls1, name = "Values_of_ls1")


#We can assign the series created to a variable that we can use to reference the series anytime we want
#ls1_series = pd.Series(ls1, name = "Values_of_ls1")

0     2
1     4
2     6
3     8
4    10
Name: Values_of_ls1, dtype: int64

In [56]:
ls1_series.name

'Values_of_ls1'

#### Pandas Series from Tuple

In [57]:
#Creating the tuple
tup1 = (1, 2, 3, 4, 5, 6, 7)

#Creating the pandas series
pd.Series(tup1)

#Creating the pandas series with a name
pd.Series(tup1, name = "Values_of_tup1")

#Assigning the pandas series to a variable name so it can be referenced any time
#tup1_series = pd.Series(tup1, name = "Values_of_tup1")

0    1
1    2
2    3
3    4
4    5
5    6
6    7
Name: Values_of_tup1, dtype: int64

In [53]:
tup1_series.name

'Values_of_tup1'

#### Creating new Index for Pandas Series


The index is the numbering to the leftmost part of the Series or DataFrame. It is pandas's way of identifying the individual rows of the dataset.

The default numbering of the index is: 0, 1, 2, 3,... and so on. However, depending on our needs, we can create a new index for either our pandas series or dataframe.

Again, we do not need to be a genius to figure out that if we want to create this new index, we use a list or a tuple.

In [62]:
tup1_index = ["A", "B", "C"]
pd.Series(tup1, name = "Values_of_tup1", index = tup1_index)

A    1
B    2
C    3
D    4
E    5
F    6
G    7
Name: Values_of_tup1, dtype: int64

The above code returns an error which states that there are 7 values in you pandas series yet, you have only 3 index values. This is simply telling us that we should ensure that the number of index values is exactly the same as the number of values in our pandas series. The same goes for pandas dataframes.

In [63]:
tup1_index = ["A", "B", "C", "D", "E", "F", "G"]
pd.Series(tup1, name = "Values_of_tup1", index = tup1_index)

A    1
B    2
C    3
D    4
E    5
F    6
G    7
Name: Values_of_tup1, dtype: int64

#### Pandas DataFrames from Dictionary

In [65]:
#Creating the dictionary

details_dict = {
    
    "Name": ["Rahmon", "Rahila", "Ramsey", "Ranti", "Richard", "Rex"],
    "Age": [10, 12, 13, 8, 9, 11],
    "Favorite Team": ["Chelsea", "Tottenham", "Barcelona", "Arsenal", "Liverpool", "Juventus"]
    
}

#Using the dictionary to create the dataframe

pd.DataFrame(details_dict)

Unnamed: 0,Name,Age,Favorite Team
0,Rahmon,10,Chelsea
1,Rahila,12,Tottenham
2,Ramsey,13,Barcelona
3,Ranti,8,Arsenal
4,Richard,9,Liverpool
5,Rex,11,Juventus


#### Pandas DataFrames by combining two series


Other than using a dictionary to create a pandas DataFrame, there might be other times when it would be required to combine two or more existing pandas series objects.

In [71]:
pd.concat([tup1_series, ls1_series], axis = 1)

Unnamed: 0,Values_of_tup1,Values_of_ls1
0,1,2.0
1,2,4.0
2,3,6.0
3,4,8.0
4,5,10.0
5,6,
6,7,


Observe that there are values in the data labelled NaN. These values represent empty spaces. They are pandas' waay of telling us that there is no value for that particular column or row.

**An Example of Panel Data**

![convert notebook to web app](./Panel_Data_Example.PNG)



From the above table, we can see that the data has different facets to it. Instead of just looking in two dimensions, in terms of rows and column names, panel data adds extra information on top of the two dimensional depictions we have been seeing so far.

In panel data, basically, the column names themselves have titles which specify what information the columns are trying to depict. For example, the country title gives the column names to be the name of countries and is the highest level of granularity of the columns. Next, we have series and pay period.

It is important to note that while this conveys a lot more information to a human, this type of data is not condusive for Machine Learning, an essential part of data science but this does not mean that analysis cannot be done on it.

If you are curious about analysis done on panel data, see the following:
https://python.quantecon.org/pandas_panel.html

(Please note that some of the code found in the above link has not been covered yet)

### Conditionals, Loops, and Functions

### Numpy

So far, we have worked with the primitive array-types in python: Lists and Tuples. These array types, however, have very limited mathematical capabilities and as a result cannot perform certain tasks required for machine learning. 

In steps Numpy arrays. Numpy arrays are used to perform complex mathematical analysis on various arrays. But why do we need to perform complex calculations on arrays?

Numpy can be used to find general descriptive statistical figures like mean, median, mode, and so on, on column vectors. In using advanced machine learning packages such as Keras and Tensorflow, for designing Artificial Neural Networks, an understanding of Numpy is key. Also, in image and signal processing, numpy is of inestimable value. Most colored images can be represented as 3-D arrays. Lists could be used to make these arrays but performing any sort of mathematics on the lists is either overly convoluted or not possible. Numpy also sorts this out as well. Signals like radio signals or just general speech can be easily processed using numpy as numpy can create as many array dimensions to represent each component of the signal.

As useful as Numpy arrays are in the field of Machine Learning, great emphasis will not be placed on Numpy arrays in this course. Perhaps more advanced courses in Deep Learning and Artificial Intelligence development with Tensorflow or Keras would make use of numpy arrays in more detail.

Irrespective of this, we can find that the Numpy library is still very useful, given some of the methods it allows us to use. As a result, 3 reading materials have been provided in the links below for further reading on NUmpy. In case any of the links says you need to subscribe in order to gain access, simply subscribe and you gain free access to the article.

1. https://towardsdatascience.com/21-numpy-functions-that-will-boost-your-data-analysis-process-1671fb35215

2. https://medium.com/swlh/5-powerful-numpy-functions-for-beginners-20f4cdb49de9

3. https://www.machinelearningplus.com/python/101-numpy-exercises-python/


### Pandas

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of Numpy, which provides support for multi-dimensional arrays.


**What can Pandas do?**
- Data cleansing
- Data fill
- Data normalization
- Merges and joins
- Data visualization
- Statistical analysis
- Data inspection
- Loading and saving data


From the above list, we can see that Pandas is able to perform many useful tasks (which might not be apparent right now). The tasks listed form a core part of the skills that every Data Scientist should have and all these tasks will be touched upon later in this course. 

For now, we demonstrate how Pandas is able to load data from various sources

**Data from CSV**

In [5]:
import pandas as pd

pd.read_csv("./working_with_tables.csv")

Unnamed: 0,Firstname,Lastname,age,sales,cost
0,Olanrewaju,Kazeem,[24],15000,11000
1,Chibuzo,Ekenne,[30],25800,15200
2,Onyinyechi,Amos,[22],45000,28000
3,Orlando,Bloom,[25],28000,22000
4,Alex,Iwobi,[34],36500,18850
5,Kachi,Felix,[50],44235,33218
6,Olujare,Dada,[28],27500,25800
7,Tomiwa,Sogaolu,[42],50509,34850
8,Timilehin,Kupolokun,[49],88390,76001
9,Nnena,Dickson,[39],122200,85900


**Data from EXCEL**

In [3]:
pd.read_excel("./data_science_people.xlsx")

Unnamed: 0,Name,Age,Favorite_food
0,Olujare,27,Beans
1,Fiyin,18,Yam
2,Priscillia,12,Rice


**Data from HTML**

In [13]:
#This example captures a table found on Wikipedia. 
#Generally, pd.read_html is used to scrape data off the internet. Web scraping is something we shall discuss later in the course
t = pd.read_html("https://en.wikipedia.org/wiki/UEFA_Champions_League")

#Taking the length of the result that pandas found on the Wikipedia page, we find that it had 26 results.
len(t)

26

In [14]:
#Viewing these results to understand them and to pinpoint what we want.

t

[                                0  \
 0                             NaN   
 1                 Organising body   
 2                         Founded   
 3                          Region   
 4                 Number of teams   
 5                   Qualifier for   
 6            Related competitions   
 7               Current champions   
 8         Most successful club(s)   
 9         Television broadcasters   
 10                        Website   
 11  2022–23 UEFA Champions League   
 
                                                     1  
 0                                                 NaN  
 1                                                UEFA  
 2               1955; 67 years ago(rebranded in 1992)  
 3                                              Europe  
 4   .mw-parser-output .plainlist ol,.mw-parser-out...  
 5                   UEFA Super CupFIFA Club World Cup  
 6   UEFA Europa League (2nd tier)UEFA Europa Confe...  
 7                            Real Madrid (14th 

In [31]:
#We can use the "match" parameter to find the exact table we are looking for. Most tables have a heading and this is what goes
#into the "match" parameter of pandas and how pandas can help us find the exact table we want.


t = pd.read_html("https://en.wikipedia.org/wiki/UEFA_Champions_League", match = "Performances in the European Cup and UEFA Champions League by club")

#Checking what result pandas brought back.
t

[   .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vte Club  \
 0                                         Real Madrid                                                                                                                          

In [33]:
#Observing the result above, we can see that the result is in a list. Remember that lists can take in any datatype.
#Therefore, to get the table out of the list, we simply apply list slicing.

t[0]

#new_table = t[0]

In [34]:
#Looking at the result above, we can see the table has been produced. There is still an issue with the column name for the
#Teams. This column name can be renamed using a simple list as shown below.
#The .columns method in new_table.columns is used to access the column header names of a pandas dataframe. Thus, reassigning
#the column names to the names given in the list solves our problem

new_table.columns = ["Team Name", "Title(s)", "Runners-up", "Season won", "Season runner-up"]

new_table

Unnamed: 0,Team Name,Title(s),Runners-up,Season won,Season runner-up
0,Real Madrid,14,3,"1956, 1957, 1958, 1959, 1960, 1966, 1998, 2000...","1962, 1964, 1981"
1,Milan,7,4,"1963, 1969, 1989, 1990, 1994, 2003, 2007","1958, 1993, 1995, 2005"
2,Bayern Munich,6,5,"1974, 1975, 1976, 2001, 2013, 2020","1982, 1987, 1999, 2010, 2012"
3,Liverpool,6,4,"1977, 1978, 1981, 1984, 2005, 2019","1985, 2007, 2018, 2022"
4,Barcelona,5,3,"1992, 2006, 2009, 2011, 2015","1961, 1986, 1994"
5,Ajax,4,2,"1971, 1972, 1973, 1995","1969, 1996"
6,Manchester United,3,2,"1968, 1999, 2008","2009, 2011"
7,Inter Milan,3,2,"1964, 1965, 2010","1967, 1972"
8,Juventus,2,7,"1985, 1996","1973, 1983, 1997, 1998, 2003, 2015, 2017"
9,Benfica,2,5,"1961, 1962","1963, 1965, 1968, 1988, 1990"


**Data from Clipboard**


The table used for this example can be found on the website given below:

https://stackoverflow.com/questions/62318682/get-pandas-datframe-values-by-key

In [6]:
pd.read_clipboard()

Unnamed: 0,id,Name,subject_id,Marks_scored,Rank
0,1,Alex,sub1,98,1
1,2,Amy,sub2,90,1
2,3,Allen,sub3,87,2
3,4,Alice,sub4,69,10
4,5,Ayoung,sub5,78,7


### How to make CSV files

This is an interactive class

#### Homework


1. Create csv 5 files with separators different from the comma (,).

2. Get the data for the league table in the English Premier League after the 2020/21 season. The "Pos" or "Position" column is not well formatted. We need only the final position at the end of the season. The data can be found in the link given below:
https://www.premierleague.com/history/season-reviews/363

3. It is reported that the ratio of the Goal Difference (GD) in that season and the number of games played (Pl) by a team, reflects the final position of that team at the end of the season. Find this ratio for all the teams and confirm if the assertion that this ratio determines the teams' final positions is true or not.

4. If the table were ranked based on the ratio calculated above, instead of number of points (Pts), create a dataframe for this new table. It should have the following column headers: "Position (POS)", "Club", "Games Played (Pl)", "Goal Difference (GD)", "Net Goal per Game (Ratio)".