<a href="https://colab.research.google.com/github/deeksha-4/LS-Neural-Networks-NLP/blob/main/Week1/Week1_NumPy_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Python**

# **NumPy**
In this assignment, you will be familiarized with the usage of the NumPy library and how to use vectorization to speed up computations in comparision to iterative approaches.

You are to only write/modify the code in between consecutive `# < START >` and `# < END >` comments. DO NOT modify other parts of the notebook, your assignments will not be graded otherwise.
```python
"Don't modify any code here"

# < START >
"YOUR CODE GOES HERE!"
# < END >

"Don't modify any code here"
```

<!---Need to include clarification about arrays?-->

**Start by running the below cell, to import the NumPy library.**

In [None]:
import numpy as np

### **Initializating Arrays**
NumPy offers multiple methods to create and populate arrays
- Create a $2\times3$ array identical to
$\begin{bmatrix}
1 & 2 & 4\\
7 & 13 & 21\\
\end{bmatrix}$, and assign it to a variable `arr`.  

In [None]:
# < START >

# < END >

print(arr)
print("Shape:", arr.shape)

You should be able to see that the `shape` property of an array lets you access its dimensions.  
For us, this is a handy way to ensure that the dimensions of an array are what we expect, allowing us to easily debug programs.

- Initialize a NumPy array `x` of dimensions $2\times3$ with random values.  
Do not use the values of the dimensions directly, instead use the variables provided as arguments.
<details>
  <summary>Hint</summary>
  <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html#numpy-random-randn">np.random.randn()</a>
</details>

In [None]:
n_rows = 2
n_columns = 3

# <START>

# <END>

print(x)

A few more basic methods to initialize arrays exist.
Feel free to read up online to complete the code snippets.

In [None]:
# < START >
# Initialize an array ZERO_ARR of dimensions (4, 5, 2) whose every element is 0

# < END >

print(ZERO_ARR)

# < START >
# Initialize an array ONE_ARR of dimensions (4, 5, 2) whose every element is 1

# < END >

print(ONE_ARR)

You can also transpose arrays (same as with matrices), but a more general and commonly used function is the `array.reshape()` function.

$$
\begin{bmatrix}
a & d\\
b & e\\
c & f\\
\end{bmatrix}
\xleftarrow{\text{.T}}
\begin{bmatrix}
a & b & c\\
d & e & f\\
\end{bmatrix}
\xrightarrow{\text{.reshape(3, 2)}}
\begin{bmatrix}
a & b\\
c & d\\
e & f\\
\end{bmatrix}
\xrightarrow{\text{.reshape(6,1)}}
\begin{bmatrix}
a\\b\\c\\d\\e\\f\\
\end{bmatrix}
$$
`reshape` is commonly used to flatten data stored in multi-dimensional arrays (ex: a 2D array representing a B/W image)

- Try it out yourself:

In [None]:
y = np.array([[1, 2, 3],
              [4, 5, 6]])

# < START >
# Create a new array y_transpose that is the transpose of matrix y

# < END >

print(y_transpose)

# < START >
# Create a new array y_flat that contains the same elements as y but has been flattened to a column array

# < END >

print(y_flat)

- Create a `y` with dimensions $3\times1$ (column matrix), with elements $4,7$ and $11$.  
$$y = \begin{bmatrix}
4\\
7\\
11
\end{bmatrix}$$  

In [None]:
# <START>
# Initialize the column matrix here

# <END>

assert y.shape == (3, 1)
# The above line is an assert statement, which halts the program if the given condition evaluates to False.
# Assert statements are frequently used in neural network programs to ensure our matrices are of the right dimensions.

print(y)

# <START>
# Multiply both the arrays here

# <END>

assert z.shape == (2, 1)

print(z)

### **Indexing & Slicing**
Just like with normal arrays, you can access an element at the `(i,j)` position using `array[i][j]`.  
However, NumPy allows you to do the same using `array[i, j]`, and this form is more efficient and simpler to use.
<details>
<summary><i>(Optional) Why is it more efficient?</i></summary>
The former case is more inefficient as a new temporary array is created after the first index i, that is then indexed by j.
</details>

```python
x = np.array([[1,3,5],[4,7,11],[5,10,20]])

x[1][2] #11
x[1,2]  #11 <-- Prefer this
```

Slicing is another important feature of NumPy arrays. The syntax is the same as that of slicing in Python lists.
  We pass the slice as

```python
  sliced_array = array[start:end:step]
  # The second colon (:) is only needed
  # if you want to use a step other than 1
```
By default, `start` is 0, `end` is the array length (in that dimension), and `step` is 1.
Remember that `end` is not included in the slice.

  Implement array slicing as instructed in the following examples



In [None]:
x = np.array([4, 1, 5, 6, 11])

# <START>
# Create a new array y with the middle 3 elements of x

# <END>

print(y)

z = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# <START>
# Create a new array w with alternating elements of z

# <END>

print(w)

A combination of indexing and slicing can be used to access rows, columns and sub-arrays of 2D arrays.

```python
arr = np.array([
          [1, 2, 3],
          [4, 5, 6],
          [7, 8, 9]])

print(arr[0])       #[1 2 3]
print(arr[:,2])     #[3 6 9]
print(arr[0:2,0:2]) #[[1 2]
                    # [4 5]]
```

In [None]:
arr_2d = np.array([[4, 5, 2],
          [3, 7, 9],
          [1, 4, 5],
          [6, 6, 1]])

# <START>
# Create a 2D array sliced_arr_2d that is of the form [[5, 2], [7, 9], [4, 5]]

# <END>

print(sliced_arr_2d)

### **Broadcasting**

This feature allows for flexibility in array operations. It lets us implement highly efficient algorithms with minimal use of memory.

In [None]:
arr1 = np.array([1, 2, 3, 4])
b = 1

# <START>
# Implement broadcasting to add b to each element of arr1

# <END>

print(arr1)

arr2 = np.array([[1, 2, 3],
                 [4, 5, 6]])
arr3 = np.array([[4],
                 [5]])

# <START>
# Multiply each element of the first row of arr2 by 4 and each element of the second row by 5, using only arr2 and arr3

# <END>

print(arr2)

### **Vectorization**

From what we've covered so far, it might not be clear as to why we need to use vectorization. To understand this, let's compare the execution times of a non-vectorized program and a vectorized one.

Your goal is to multiply each element of the 2D arrays by 3. Implement this using both non-vectorized and vectorized approaches.

In [None]:
import time

arr_nonvectorized = np.random.rand(1000, 1000)
arr_vectorized = np.array(arr_nonvectorized) # making a deep copy of the array

start_nv = time.time()

# Non-vectorized approach
# <START>



# <END>

end_nv = time.time()
print("Time taken in non-vectorized approach:", 1000*(end_nv-start_nv), "ms")

# uncomment and execute the below line to convince yourself that both approaches are doing the same thing
# print(arr_nonvectorized)

start_v = time.time()

# Vectorized approach
# <START>

# <END>

end_v = time.time()
print("Time taken in vectorized approach:", 1000*(end_v-start_v), "ms")

# uncomment and execute the below line to convince yourself that both approaches are doing the same thing
# print(arr_vectorized)

Try playing around with the dimensions of the array. You'll find that there isn't much difference in the execution times when the dimensions are small. But in neural networks, we often deal with very large datasets and so vectorization is a very important tool.

# **Pandas**

This section will help you get familia with various functions of pandas and how to use them.

As you go through this section, you will find a ??? in certain places. To complete this section, you must replace all the ??? with appropriate values, expressions or statements to ensure that the notebook runs properly end-to-end


Let's start by importing the pandas library

In [None]:
import pandas as pd

### **Loading data using url**

In [6]:
url = "https://raw.githubusercontent.com/Mehul-Agrawal410/SpeakSpeare/main/Week%201/countries.csv"

df = pd.read_csv(url, decimal=',')
df

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.00,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.60,20.34,0.380,0.240,0.380
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.00,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.600,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.00,15.00,75.00,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.00,6.60,4.05,19000.0,100.0,497.2,2.22,0.00,97.78,3.0,8.71,6.25,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,West Bank,NEAR EAST,2460492,5860,419.9,0.00,2.98,19.62,800.0,,145.2,16.90,18.97,64.13,3.0,31.67,3.92,0.090,0.280,0.630
223,Western Sahara,NORTHERN AFRICA,273008,266000,1.0,0.42,,,,,,0.02,0.00,99.98,1.0,,,,,0.400
224,Yemen,NEAR EAST,21456188,527970,40.6,0.36,0.00,61.50,800.0,50.2,37.2,2.78,0.24,96.98,1.0,42.89,8.30,0.135,0.472,0.393
225,Zambia,SUB-SAHARAN AFRICA,11502010,752614,15.3,0.00,0.00,88.29,800.0,80.6,8.2,7.08,0.03,92.90,2.0,41.00,19.93,0.220,0.290,0.489


*(Optional)*: You can rename the columns for your convenience later on. Look up the syntax for renaming columns

**Q1: How many countries does the dataframe contain?**

Hint: Use the `.shape` method.

In [None]:
num_countries = ???
print(f'There are {num_countries} countries in the dataset')

**Q2: Retrieve a list of the Regions from the dataframe?**

*Hint: Use the `.unique` method of a series.*

In [None]:
continents = ???
continents

**Q3: What is the total population of all the countries listed in this dataset?**

In [None]:
total_population = ???
print(f'The total population is {total_population}.')

**Q4: Create a dataframe containing 10 countries with the highest population.**

*Hint: Chain the `sort_values` and `head` methods.*

In [None]:
most_populous_df = ???
most_populous_df

**Q5: Add a new column in `countries_df` to record the overall GDP per country (product of population & per capita GDP).**

In [None]:
df['gdp'] = ???
df

**Q6: Create a data frame that counts the number countries in each region?**

*Hint: Use `groupby`, select the `location` column and aggregate using `count`.*

In [None]:
country_counts_df = ???
country_counts_df

**Q7: Create a data frame showing the total population of each region.**

*Hint: Use `groupby`, select the population column and aggregate using `sum`.*

In [None]:
region_populations_df = ???
region_populations_df

**Q8: Count the number of null values in each column**

*Hint: Use `isna`.*

In [10]:
na_values = df.isna().sum()
na_values

Country                                0
Region                                 0
Population                             0
Area (sq. mi.)                         0
Pop. Density (per sq. mi.)             0
Coastline (coast/area ratio)           0
Net migration                          3
Infant mortality (per 1000 births)     3
GDP ($ per capita)                     1
Literacy (%)                          18
Phones (per 1000)                      4
Arable (%)                             2
Crops (%)                              2
Other (%)                              2
Climate                               22
Birthrate                              3
Deathrate                              4
Agriculture                           15
Industry                              16
Service                               15
dtype: int64

**Q9: Fill all the null values with the mean of their respective coluumn**

*Hint: Use `mean` and `fillna`.*

In [None]:
# <START>

# <END>

**Q10: Create a dataframe containing 10 countries with the lowest GDP per capita, among the counties with population greater than 100 million.**

In [None]:
# <START>

# <END>

# **Matplotlib**
