## **Calculating probabilities**

In [45]:
import pandas as pd
import numpy as np

In [46]:
amir=pd.read_csv('/content/drive/MyDrive/DataCamp/Data/amir_deals.csv',index_col=0)
amir.head()

Unnamed: 0,product,client,status,amount,num_users
1,Product F,Current,Won,7389.52,19
2,Product C,New,Won,4493.01,43
3,Product B,New,Won,5738.09,87
4,Product I,Current,Won,2591.24,83
5,Product E,Current,Won,6622.97,17


## **df.shape method**


---



In [47]:
print('row_col:',amir.shape) #row & column
print('row:',amir.shape[0]) #only row
print('col:',amir.shape[1]) #only column

row_col: (178, 5)
row: 178
col: 5


1) Count the number of deals Amir worked on for each **product** type and store it as **counts**.

2) Calculate the probability of selecting a deal for the different **product** types by dividing the **counts** by the total number of deals Amir worked on. Save this as **probs**.

## **If we want to convert a series into a DataFrame, we can use `.to_frame()`**


---


In [48]:
#1 first we have to count the variable
counts=amir['product'].value_counts()
counts.to_frame()

Unnamed: 0,product
Product B,62
Product D,40
Product A,23
Product C,15
Product F,11
Product H,8
Product I,7
Product E,5
Product N,3
Product G,2


In [49]:
#2 secondly, we have to divide this counts by df.shape[0]
# The result is the desire probability
probs=counts/amir.shape[0]
probs

Product B    0.348315
Product D    0.224719
Product A    0.129213
Product C    0.084270
Product F    0.061798
Product H    0.044944
Product I    0.039326
Product E    0.028090
Product N    0.016854
Product G    0.011236
Product J    0.011236
Name: product, dtype: float64

So, this is the probability of selecting differnet product.

# **Sampling deals with/without replacement**


---



In the previous exercise, you counted the deals Amir worked on. Now it's time to randomly pick five deals so that you can reach out to each customer and ask if they were satisfied with the service they received. You'll try doing this both with and without replacement.

Additionally, you want to make sure this is done randomly and that it can be reproduced in case you get asked how you chose the deals, so you'll need to set the random seed before sampling from the deals.

1) Set the random seed to 50.
Take a sample of 10 deals without replacement and store them as sample_without_replacement.

2) Take a sample of 10 deals with replacement and save as sample_with_replacement.

3) What type of sampling is better to use for this situation?

In [50]:
#1
np.random.seed(50)
without=amir.sample(10)
without

Unnamed: 0,product,client,status,amount,num_users
118,Product D,Current,Lost,3416.82,12
91,Product D,New,Lost,5744.26,17
81,Product B,Current,Won,5430.55,38
142,Product B,Current,Won,4795.64,8
144,Product A,Current,Won,5736.79,72
136,Product D,Current,Won,4814.25,67
123,Product B,Current,Won,7219.99,16
119,Product D,Current,Lost,5592.66,85
25,Product B,New,Lost,3124.62,31
168,Product B,Current,Won,4965.08,9


In [51]:
#2
np.random.seed(50)
sam_with=amir.sample(10,replace=True)
sam_with

Unnamed: 0,product,client,status,amount,num_users
177,Product A,Current,Won,6448.07,34
140,Product B,Current,Won,4566.61,9
110,Product H,Current,Won,5257.16,22
34,Product E,New,Lost,3706.78,3
133,Product B,New,Lost,4429.76,90
71,Product I,Current,Won,2588.6,51
71,Product I,Current,Won,2588.6,51
23,Product F,Current,Won,5226.04,82
134,Product D,Current,Won,5992.86,98
3,Product B,New,Won,5738.09,87


If we sample with replacement, we might end up calling the same customer twice.
So, if we want an unique result each time, then we have to use sampling without replacement.

But if we want same selecting probability each time, then we have to use sampling with replacement.

## **Creating a probability distribution**


---


A new restaurant opened a few months ago, and the restaurant's management wants to optimize its seating space based on the size of the groups that come most often. On one night, there are 10 groups of people waiting to be seated at the restaurant, but instead of being called in the order they arrived, they will be called randomly. 

In this exercise, you'll investigate the probability of groups of different sizes getting picked first. Data on each of the ten groups is contained in the restaurant_groups DataFrame.

**Remember that expected value can be calculated by multiplying each possible outcome with its corresponding probability and taking the sum**.

In [52]:
restaurant=pd.read_csv('/content/drive/MyDrive/DataCamp/Data/restaurant_groups.csv')
restaurant

Unnamed: 0,group_id,group_size
0,A,2
1,B,4
2,C,6
3,D,2
4,E,2
5,F,2
6,G,3
7,H,2
8,I,4
9,J,2




---


1) Count the number of each **group_size** in **restaurant_groups**, then divide by the number of rows in **restaurant_groups** to calculate the probability of randomly selecting a group of each size. Save as **size_prob**.

-Reset the index of **size_prob**.

-Rename the columns of **size_prob** to **group_size** and **prob**.


---


2) Calculate the expected value of the **size_prob**, which represents the expected group size, by multiplying the **group_size** by the **prob** and taking the sum.


---


3) Calculate the probability of randomly picking a group of 4 or more people by subsetting for groups of size 4 or more and summing the probabilities of selecting those groups.

In [53]:
#1 probability of selecting each group_size in restaurant
size_prob=restaurant['group_size'].value_counts()/restaurant.shape[0]
size_prob

2    0.6
4    0.2
6    0.1
3    0.1
Name: group_size, dtype: float64

In [54]:
# Reset index and rename columns
size_prob=size_prob.reset_index()
size_prob.columns=['group_size','probability']
size_prob

Unnamed: 0,group_size,probability
0,2,0.6
1,4,0.2
2,6,0.1
3,3,0.1


In [55]:
#2 Expected value
expt=np.sum(size_prob['group_size']*size_prob['probability'])
expt

2.9000000000000004

So, the expected group size is 3.

In [56]:
#4
four=size_prob[size_prob['group_size']>=4]['probability']
four.to_frame()

Unnamed: 0,probability
1,0.2
2,0.1


In [57]:
prob_four=np.sum(four)
prob_four

0.30000000000000004

So, the probability of selecting group_size of 4 or more than 4 is 30%.