## **Introduction to DAV (Data Analysis and Visualization) Module**

With this lecture, we're starting the DAV module.

It will contain 3 sections -

1. DAV-1: Python Libraries
 - Numpy
 - Pandas
 - Matplotlib & Seaborn
2. DAV-2: Probability Statistics
3. DAV-3: Hypothesis Testing

#Numpy -1

## **Content**

- Introduction to DAV
- Python Lists vs Numpy Array
  - Importing Numpy
  - Why use Numpy?
- Dimension & Shape
- Type Conversion in Numpy Arrays
- Indexing & Slicing
- NPS use case

##Python Lists vs Numpy

### **Homogeneity of data**

So far, we've been working with Python lists, that can have **heterogenous data**.

In [None]:
a = [1, 2, 3, "Michael", True]
a

[1, 2, 3, 'Michael', True]

Because of this hetergenity, in Python lists, the data elements are not stored together in the memory (RAM).

- Each element is stored in a different location.
- Only the address of each of the element will be stored together.
- So, a list is actually just referencing to these different locations, in order to access the actual element.

\
On the other hand, Numpy only stores **homogenous data**, i.e. a numpy array cannot contain mixed data types.

It will either
- ONLY contain integers
- ONLY contain floats
- ONLY contain characters

... and so on.

Because of this, we can now store these different data items together, as they are of the same type.

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/995/original/download.png?1706870327" width=700 height=175>

### **Speed**

Programming languages can also be slow or fast.

In fact,
- Java is a decently fast language.
- Python is a slow language.
- C, one of the earliest available languages, is super fast.

This is because C has concepts like memory allocation, pointers, etc.

#### **How is this possible?**

With Numpy, though we will be writing our code using Python, but behind the scene, all the code is written in the **C programming language**, to make it faster.

Because of this, a Numpy Array will be significantly faster than a Python List in performing the same operation.

This is very important to us, because in data science, we deal with huge amount of data.


### **Properties**

- **In-built Functions**
 - For a Python list `a`, we had in-built functions like `.sum(a)`, etc.
 - For NumPy arrays also, we will have such in-built functions.

- **Slicing**
 - Recall that we were able to perform list slicing.
 - All of that is still applicable here.


Recall how we used to import a module/library in Python.

* In order to use Python Lists, we do not need to import anything extra.
* However to use Numpy Arrays, we need to import it into our environment, as it is a Library.

Generally, we do so while using the alias **`np`**.

In [None]:
import numpy as np

**Note:**
- In this terminal, we will already have numpy installed as we are working on Google Colab
- However, when working on an evironment that does not have it installed, you'll have to install it the first time working.
- This can be done with the command: `!pip install numpy`

## **Why use Numpy? - Time Comparison**

In [None]:
a = [1,2,3,4,5]

In [None]:
type(a)

list

The basic approach here would be to iterate over the list and square each element.

In [None]:
res=[]
for i in a:
  res.append(i**2)
print(res)

[1, 4, 9, 16, 25]


To do so, first of all we need to define the Numpy array.

We can convert any list `a` into a Numpy array using the `array()` function.

In [None]:
b = np.array(a)
b

array([1, 2, 3, 4, 5])

In [None]:
type(b)

numpy.ndarray

- `nd` in `numpy.ndarray` stands for **n-dimensional**

Now, how can we get the square of each element in the same Numpy array?

In [None]:
b**2

array([ 1,  4,  9, 16, 25])

In [None]:
a=np.array([1,15,20])

In [None]:
a**3

array([   1, 3375, 8000])

In [None]:
a

array([ 1, 15, 20])

In [None]:
type(a)

numpy.ndarray

In [None]:
a=np.array(2)

In [None]:
a

array(2)

In [None]:
ab=np.array(['Hi','Ji'])

In [None]:
type(ab)

numpy.ndarray

In [None]:
arr=np.array([1,10,13,73,1387])

In [None]:
arr+100

array([ 101,  110,  113,  173, 1487])

In [None]:
arr//2

array([  0,   5,   6,  36, 693])

**The biggest benefit of Numpy is that it supports element-wise operation.**

Notice how easy and clean is the syntax.

**What is the major reason behind numpy's faster computation?**

- Numpy array is densely packed in memory due to it's **homogenous** type.
- Numpy functions are implemented in **C programming launguage**.
- Numpy is able to divide a task into multiple subtasks and process them **parallelly.**

In [None]:
np.arange(1,100000000*10,2)

array([        1,         3,         5, ..., 999999995, 999999997,
       999999999])

In [None]:
np.arange(0,5,0.8)

array([0. , 0.8, 1.6, 2.4, 3.2, 4. , 4.8])

## **Dimensions and Shape**

**We can get the dimension of an array using the `ndim` property.**

In [None]:
arr1=np.arange(100000)
arr1.ndim

1

In [None]:
arr1 #1D Array

array([    0,     1,     2, ..., 99997, 99998, 99999])

In [None]:
arr2=np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
arr2.ndim #2D Array

2

**Numpy arrays have another property called `shape` that tells us number of elements across every dimension.**

In [None]:
arr1.shape

(100000,)

In [None]:
arr2.shape

(4, 3)

This means that the array `arr1` has 1000000 elements in a single dimension.

Let's take another example to understand `shape` and `ndim` better.

In [None]:
arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])
print(arr2)

[[ 1  2  3]
 [ 4  5  6]
 [10 11 12]]


**What do you think will be the shape & dimension of this array?**

In [None]:
arr2.ndim

2

In [None]:
arr2.shape

(3, 3)

`ndim` specifies the number of dimensions of the array i.e. 1D (1), 2D (2), 3D (3) and so on.

`shape` returns the exact shape in all dimensions, that is (3,3) which implies 3 in axis 0 and 3 in axis 1.

<img src="https://drive.google.com/uc?id=1GSV_E1CaCc_Ur7pWJ-Kqv0VKvBRwByR1">

### **`np.arange()`**

Let's create some sequences in  Numpy.

We can pass **starting** point, **ending** point (not included in the array) and **step-size**.

**Syntax:**
- `arange(start, end, step)`

In [None]:
np.arange(1,15,3)

array([ 1,  4,  7, 10, 13])

:`np.arange()` behaves in the same way as `range()` function.
- In `np.arange()`, we can pass a **floating point number** as **step-size**.

## **Type Conversion in Numpy Arrays**

For this, let's pass a **float** as one of the values in a **numpy array**.

In [None]:
arr4 = np.array([1, 2, 3, 4])
arr4

array([1, 2, 3, 4])

In [None]:
arr4 = np.array([1, 2, 3, 4.0])
arr4

array([1., 2., 3., 4.])

In [None]:
arr=np.array([1.0,2.0,3])
arr

array([1., 2., 3.])

- Notice that **`int` is raised to `float`**
- Because a numpy array can only store **homogenous data** i.e. values of one data type.

There's a `dtype` parameter in the `np.array()` function.

**What if we set the `dtype` of array containing `integer` values to `float`?**

In [None]:
arr5 = np.array([1, 2, 3, 4])
arr5

array([1, 2, 3, 4])

In [None]:
arr5 = np.array([1, 2, 3, 4], dtype="float")
arr5

array([1., 2., 3., 4.])

**Question:** What will happen in the following code?

Since it is not possible to convert strings of alphabets to floats, it will naturally return an Error.

\
We can also convert the data type with the `astype()` method.

In [None]:
arr = np.array([10, 20, 30, 40, 50])
arr

array([10, 20, 30, 40, 50])

In [None]:
arr = arr.astype('float64')
print(arr)

[10. 20. 30. 40. 50.]


## **Indexing**

- Similar to Python lists

In [None]:
#Positive Indexing

In [None]:
arr=np.arange(41,76,3)
arr

array([41, 44, 47, 50, 53, 56, 59, 62, 65, 68, 71, 74])

In [None]:
arr[0]

41

In [None]:
#Negative Indexing

In [None]:
arr[-1]

74

In [None]:
arr[len(arr)-1]

74

You can also use list of indexes in numpy.

Did you notice how single index can be repeated multiple times when giving list of indexes?

**Note:**
- If you want to extract multiple indices, you need to use two sets of square brackets `[[ ]]`
  - Otherwise, you will get an error.
- Because it is only expecting a single index.
- For multiple indices, you need to pass them as a list.

##**Slicing**

In [None]:
arr=np.arange(11,18)

In [None]:
arr

array([11, 12, 13, 14, 15, 16, 17])

**Question:** What'll be output of `arr[-5:-1]` ?

In [None]:
arr[-5:-1]

array([13, 14, 15, 16])

**Question:** What'll be the output for `arr[-5:-1: -1]` ?

In [None]:
arr[-5:-1:-1]

array([], dtype=int64)

## **Fancy Indexing (Masking)**

- Numpy arrays can be indexed with boolean arrays (masks).
- This method is called **fancy indexing** or **masking**.

\
What would happen if we do this?

In [None]:
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1 < 6

array([ True,  True,  True,  True,  True, False, False, False, False,
       False])

**Comparison operation also happens on each element**.
- All the values before 6 return `True`
- All the values after 6 return `False`

**Question:** What will be the output of the following?

In [None]:
m1[[True,  True,  True,  True,  True, False, False, False, False, False]]

array([1, 2, 3, 4, 5])

In [None]:
m1[m1<6]

array([1, 2, 3, 4, 5])

Notice that we are passing a list of indices.
- For every instance of `True`, it will print the corresponding index.
- Conversely, for every `False`, it will skip the corresponding index, and not print it.

So, this becomes a **filter** of sorts.

Now, let's use this to filter or mask values from our array.

**Condition will be passed instead of indices and slice ranges.**

In [None]:
m1[m1 < 6]

array([1, 2, 3, 4, 5])

This is known as **Fancy Indexing** in Numpy.

\
**Question:** How can we filter/mask even values from our array?

In [None]:
m1[m1%2 == 0]

array([ 2,  4,  6,  8, 10])

## **Use Case: NPS (Net Promoter Score)**

#### Imagine you are a Data Analyst @ Airbnb

You've been asked to analyze user survey data and report NPS to the management.

#### But, what exactly is NPS?

Have you all seen that every month, you get a survey form from Scaler?

- This form asks you to fill in feedback regarding how you are liking the services of Scaler in terms of a numerical score.
- This is known as the **Likelihood to Recommend Survey**.
- It is widely used by different companies and service providers to evaluate their performance and customer satisfaction.

<img src="https://drive.google.com/uc?id=1-u8e-v_90JdikorKsKzBM-JJqoRtzsN8">

- Responses are given a scale ranging from 0–10,
    - with 0 labeled with “Not at all likely,” and
    - 10 labeled with “Extremely likely.”

Based on this, we calculate the **Net Promoter Score**.

### **How to calculate NPS score?**

<img src="https://drive.google.com/uc?id=1KPIYlaN68vlL99iApaF5QbeBoyT24-Eu">

We label our responses into 3 categories:
- **Detractors**: Respondents with a score of 0-6
- **Passive**: Respondents with a score of 7-8
- **Promoters**: Respondents with a score of 9-10.

```
Net Promoter score = % Promoters - % Detractors.
```

### **Range of NPS**

- If all people are promoters (rated 9-10), we get $100$ NPS
- Conversely, if all people are detractors (rated 0-6), we get $-100$ NPS
- Also, if all people are neutral (rated 7-8), we get a $0$ NPS

Therefore, the range of NPS lies between $[-100, 100]$

\
Generally, each company targets to get at least a threshold NPS.
- For Scaler, this is a score of 70.
- This means that if $NPS > 70%$, it is great performance of the company.

Naturally, this varies from business to business.

### **How is NPS helpful?**

####  Why would we want to analyse the survey data for NPS?

NPS helps a brand in gauging its brand value and sentiment in the market.

- Promoters are highly likely to recommend your product or sevice. Hence, bringing in more business.
- whereas, Detractors are likely to recommend against your product or service’s usage. Hence, bringing the business down.

\
These insights can help business make customer oriented decision along with product improvisation.

**2/3 of Fortune 500 companies use NPS**

\
Even at Scaler, every month, we randomnly reach out to our learners over a call, and try to understand,
- How is the overall experience for them?
- What are some things that they like?
- What do they don't like?

Based on the feedback received, sometimes we end up getting really good insights, and tackle them.

This will help improve the next month's NPS.

### **NPS Problem**

Let's first look at the data we have gathered.

**Dataset:** https://drive.google.com/file/d/1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK/view?usp=sharing

<img width = 500 src="https://drive.google.com/uc?id=1arJhLlzbr_Rf7ONxpkzo726mLbTyLb_p">


In [None]:
arr=[102,34,13]

In [None]:
np.array(arr)

array([102,  34,  13])

In [None]:
np.array([1,2,3,4,5,8])

In [None]:
arr2=np.arange(1,23,7)

In [None]:
arr2.ndim

1

In [None]:
arr3=np.array([[12,3],[472847,88]])

In [None]:
arr3.shape

NameError: name 'arr3' is not defined

#Definining a numpy array

In [None]:
np.arange(1,100,0.5)

array([ 1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,  5.5,  6. ,
        6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. , 10.5, 11. , 11.5,
       12. , 12.5, 13. , 13.5, 14. , 14.5, 15. , 15.5, 16. , 16.5, 17. ,
       17.5, 18. , 18.5, 19. , 19.5, 20. , 20.5, 21. , 21.5, 22. , 22.5,
       23. , 23.5, 24. , 24.5, 25. , 25.5, 26. , 26.5, 27. , 27.5, 28. ,
       28.5, 29. , 29.5, 30. , 30.5, 31. , 31.5, 32. , 32.5, 33. , 33.5,
       34. , 34.5, 35. , 35.5, 36. , 36.5, 37. , 37.5, 38. , 38.5, 39. ,
       39.5, 40. , 40.5, 41. , 41.5, 42. , 42.5, 43. , 43.5, 44. , 44.5,
       45. , 45.5, 46. , 46.5, 47. , 47.5, 48. , 48.5, 49. , 49.5, 50. ,
       50.5, 51. , 51.5, 52. , 52.5, 53. , 53.5, 54. , 54.5, 55. , 55.5,
       56. , 56.5, 57. , 57.5, 58. , 58.5, 59. , 59.5, 60. , 60.5, 61. ,
       61.5, 62. , 62.5, 63. , 63.5, 64. , 64.5, 65. , 65.5, 66. , 66.5,
       67. , 67.5, 68. , 68.5, 69. , 69.5, 70. , 70.5, 71. , 71.5, 72. ,
       72.5, 73. , 73.5, 74. , 74.5, 75. , 75.5, 76

In [None]:
np.array([[1,2],[3,4]])

array([[1, 2],
       [3, 4]])

In [None]:
a=[1,2,3,4,5]

In [None]:
np.array(a)

array([1, 2, 3, 4, 5])

In [None]:
ab=np.arange(1,19,2)

In [None]:
ab.astype('float')

array([ 1.,  3.,  5.,  7.,  9., 11., 13., 15., 17.])

In [None]:
ab.dtype

dtype('int64')

In [None]:
np.array(ab,dtype='float')

array([ 1.,  3.,  5.,  7.,  9., 11., 13., 15., 17.])

In [None]:
ac=np.arange(15,109,6)

In [None]:
ac.dtype

dtype('int64')

In [None]:
ac.astype('float')

array([ 15.,  21.,  27.,  33.,  39.,  45.,  51.,  57.,  63.,  69.,  75.,
        81.,  87.,  93.,  99., 105.])

In [None]:
arr=np.arange(10)

In [None]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
arr>6

array([False, False, False, False, False, False, False,  True,  True,
        True])

In [None]:
arr[arr>6]

array([7, 8, 9])

In [None]:
arr[arr%2==0]

array([0, 2, 4, 6, 8])

In [None]:
arr2=np.arange(1,99,2)
arr2

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67,
       69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97])

In [None]:
arr2[(arr2%5==0)  | (arr2%7 ==0)]

array([ 5,  7, 15, 21, 25, 35, 45, 49, 55, 63, 65, 75, 77, 85, 91, 95])

In [None]:
import numpy as np

In [None]:
import numpy as np
survey = np.loadtxt('survey.txt',dtype='int')

In [None]:
type(survey)

numpy.ndarray

In [None]:
len(survey)

1167

In [None]:
survey

array([ 7, 10,  5, ...,  5,  9, 10])

#NPS Solution

In [None]:
import numpy as np #importing the library

In [None]:
data=np.loadtxt('survey.txt',dtype='int')#Loading the data

In [None]:
data

array([ 7, 10,  5, ...,  5,  9, 10])

In [None]:
promoters=data[data>=9] #Creating an array for promoters
promoters

array([10,  9,  9,  9,  9,  9, 10,  9,  9, 10,  9,  9,  9,  9,  9,  9,  9,
       10, 10,  9, 10,  9, 10,  9,  9, 10, 10,  9, 10,  9, 10, 10, 10,  9,
        9, 10, 10, 10,  9, 10,  9, 10,  9,  9,  9, 10,  9,  9,  9,  9,  9,
        9, 10,  9,  9,  9, 10,  9, 10,  9,  9,  9,  9,  9, 10, 10,  9, 10,
        9,  9, 10,  9,  9, 10,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,
       10,  9,  9,  9,  9,  9,  9,  9,  9, 10,  9,  9,  9,  9,  9, 10, 10,
        9, 10, 10, 10, 10,  9, 10,  9,  9,  9,  9,  9,  9,  9, 10,  9, 10,
        9, 10,  9, 10, 10, 10,  9,  9, 10, 10,  9,  9,  9,  9, 10,  9,  9,
        9,  9, 10, 10,  9,  9,  9, 10,  9,  9,  9,  9,  9, 10,  9,  9,  9,
       10,  9, 10, 10,  9,  9,  9,  9, 10,  9, 10,  9,  9,  9,  9, 10,  9,
       10,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10,  9,  9,  9,
       10,  9, 10, 10, 10, 10, 10,  9,  9,  9,  9, 10, 10, 10,  9,  9, 10,
        9, 10, 10, 10, 10,  9, 10, 10,  9, 10,  9,  9,  9,  9, 10,  9, 10,
        9, 10,  9,  9,  9

In [None]:
detractors=data[data<=6] #Creating array for detractors
detractors

array([5, 4, 4, 5, 1, 5, 5, 1, 4, 5, 4, 4, 4, 5, 1, 4, 1, 4, 1, 5, 5, 1,
       1, 4, 1, 5, 4, 1, 1, 4, 1, 5, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 4, 1,
       1, 5, 5, 5, 4, 4, 1, 4, 1, 4, 1, 5, 1, 1, 5, 4, 4, 4, 4, 1, 4, 5,
       4, 4, 1, 1, 5, 5, 1, 5, 1, 5, 5, 4, 5, 4, 1, 1, 1, 1, 4, 1, 4, 4,
       5, 4, 1, 1, 1, 1, 5, 4, 5, 5, 4, 1, 5, 1, 4, 4, 1, 1, 1, 4, 4, 5,
       5, 4, 5, 5, 5, 1, 4, 1, 5, 5, 1, 5, 1, 1, 5, 5, 4, 4, 1, 4, 4, 4,
       1, 1, 4, 4, 4, 5, 5, 1, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 5, 4, 4,
       5, 1, 4, 5, 5, 5, 1, 5, 4, 1, 1, 5, 5, 5, 4, 5, 4, 4, 1, 4, 4, 4,
       4, 5, 1, 5, 5, 1, 4, 4, 5, 1, 1, 4, 5, 5, 5, 1, 4, 5, 5, 4, 1, 5,
       5, 5, 1, 1, 5, 5, 1, 1, 1, 4, 5, 5, 4, 4, 4, 5, 1, 4, 1, 4, 5, 4,
       5, 5, 1, 5, 1, 5, 5, 1, 4, 5, 5, 4, 1, 5, 1, 4, 1, 4, 1, 1, 1, 1,
       1, 1, 4, 1, 5, 4, 5, 1, 5, 1, 5, 4, 4, 4, 4, 5, 5, 1, 4, 1, 5, 5,
       1, 4, 1, 1, 4, 4, 4, 4, 1, 4, 1, 1, 4, 1, 5, 4, 1, 1, 5, 4, 5, 4,
       4, 4, 1, 5, 5, 1, 4, 5, 4, 4, 4, 1, 4, 1, 4,

In [None]:
promoters_percent= np.round(len(promoters)/len(data) *100,2)
promoters_percent

52.19

In [None]:
detractors_percent =np.round(len(detractors)/len(data) *100 ,2)
detractors_percent

28.45

In [None]:
NPS=promoters_percent - detractors_percent
print(f'Nps score is {NPS}')

Nps score is 23.74
