# ICC Women's World Cup-2022, world cup 2023 Datasets

*Matchwise data of ICC Womens World Cup 2022 and 2023*

Source: https://www.kaggle.com/datasets/aravindas01/icc-womens-world-cup2022-dataset?resource=download

### About Dataset

This dataset contains the latest match-wise data of the ICC Women's World Cup 2022 as of April 3, 2023. This dataset can be used to analyse the women's World Cup series held in New Zealand.

### Women World Cup 2022 Data Description

**Match_id** - Match Number<br>
**Team_1** - Name of playing team 1<br>
**Team_2** - Name of playing team 2<br>
**Venue** - Name of Venue<br>
**Stage**- Tournament stage<br>
**Toss-_winner** -Which team won the toss<br>
**Toss_decision** - Decision chosen by toss winning captain<br>
**First_innings_score** - Score in the first innings<br>
**First_innings_wkts** - Wickets fallen in first innings<br>
**Second_innings** - Score in the second innings<br>

### Women World Cup 2022 Data Description

Match Number<br>
Stage<br>
Team 1<br>
Team 2<br>
winning team<br>
won by<br>

### Importing Libraries and loading the data

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df =pd.read_csv('Datasets\WomenWorldCup2022.csv')
df23 =pd.read_csv('Datasets\WomenWorldCup2023.csv')

In [5]:
df.head()

Unnamed: 0,Match_id,Team_1,Team_2,Venue,Stage,Toss_winner,Toss_decision,First_innings_score,First_innings_wkts,Second_innings_score,Second_innings_wkts,Winner,Player_of_the_match
0,1,New Zealand,West Indies,"Bay Oval, Mount Maunganui",Group,New Zealand,Field,259,9,256.0,10.0,West Indies,Hayley Matthews
1,2,Bangladesh,South Africa,"University Oval, Dunedin",Group,Bangladesh,Field,207,10,175.0,10.0,South Africa,Ayabonga Khaka
2,3,Australia,England,"Seddon Park, Hamilton",Group,England,Field,310,3,298.0,8.0,Australia,Rachael Haynes
3,4,Pakistan,India,"Bay Oval, Mount Maunganui",Group,India,Bat,244,7,137.0,10.0,India,Pooja Vastrakar
4,5,New Zealand,Bangladesh,"University Oval, Dunedin",Group,New Zealand,Field,140,8,144.0,1.0,New Zealand,Suzie Bates


## Brief primer on Descriptive statistics

### Central Tendencies

The central tendencies are values which represent the central or 'typical' value of the given distribution.

Three most popular central tendency estimates are mean, median and mode. Typicall we resort to using mean (for normal distributions) and median (for skewed distributions) to report central tendency values.

A good rule of thumb is to use mean when outliers don't affect its value and median when it does.

In [6]:
#Calculating the mean and median on first innings
First = df['First_innings_score']
First.mean(), First.median()

(220.74193548387098, 234.0)

In [7]:
#Calculating the mean and median on second innings
Second = df['Second_innings_score']
Second.mean(), Second.median()

(187.33333333333334, 184.0)

### Measures of Spread

This is the measure of how far from the mean the values tend to go. **Variance* and *Standard Deviation* are used to quantitavely measure spread.

They are dependent quantities, with the standard deviation being defined as the square root of variance.

In [8]:
#Calculating the variance and standard deviation of first innings
First.var(), First.std()

(5286.464516129033, 72.70807737885133)

In [9]:
#Calculating the variance and standard deviation of second innings
Second.var(), Second.std()

(3566.091954022989, 59.71676443029201)

The mean and the standard deviation are often the best quantities to summarize the data for distributions with symmetrical histograms without too many outliers.

The mean and standard deviation measures are sufficient information and other tendencies such as median does not add too much extra information

In [10]:
#Histogram for first innings
import plotly.express as px 
px.histogram(First)                      

In [11]:
#Histogram for second innings
import plotly.express as px 
px.histogram(Second)

**Plotting First and Second innings on the same barplot**

In [12]:
import plotly.graph_objects as go

# Create histogram traces for each data
trace1 = go.Histogram(
    x= First,
    opacity=0.75,
    name='First innings'
    
)

trace2 = go.Histogram(
    x= Second,
    opacity=0.75,
    name='Second innings'
)

#Create Layout
layout = go.Layout(
    title='First and second innings',
    xaxis=dict(title='Value'),
    yaxis=dict(title='Frequency'),
    barmode='overlay'  # Overlay histograms
)

# Create figure object and add traces
fig = go.Figure(data=[trace1, trace2], layout=layout)

# Show the figure
fig.show()

### Binomial Distribution

A Bernoulli trial is an experiment with exactly two possible outcomes: **Success* or *Faliure*. Success is denotes with 1 and faliure with 0. Berenoulli trials have the same success (and faliure) rate for every tial and each trial is independent of every other trial.

The coin toss is a classic exanple of a Bernoulli trial. Simulating an experiment where we flip a coin 100 times and count the number of heads. we will then proceed this experiment a thousand times and plot the binomial distribution of the number of heads we got in each experiment.

In [13]:
np.random.seed(42)

In [14]:
outcomes = [] 
for i in range(1000):
    heads = np.random.binomial(1000,0.5)
    outcomes.append(heads)

px.histogram(outcomes)

### Normal Distribution

Simulating a similar coin experiment as above. Let us sample 1000 points from a normal distribution and plot the number of occurences in the form of a histogram

In [41]:
outcomes = []
for i in range(1000):
    sample = np.random.normal(0,1)
    outcomes.append(sample)
    
px.histogram(outcomes)

### Normal Tests

Heights and weights data is approximately is normally distributed. How can one test that the distribution is normal?

**This is usually done in two ways:**
- **Histograms**  if the distribution shape is like a bell curve, we can reasonably sure that it is normal.

- **Normal test** the scipy package gives a very handy normaltest method that lets one calculate the probability that the distribution is normal, by chance.

In [16]:
#Checking for normalcy in the dataset:
stats.normaltest(First)

NormaltestResult(statistic=1.548268939471412, pvalue=0.461102706680516)

In [17]:
px.histogram(First)

As observed above, **the First innings score** is not normally distributed. The histogram plot does not exactly resemble a bell curve and the normal test gives a p-value of 0.46.. which means that there almost no chance that the distribution is normal

If the p-value is less than or equal to 0.05 reject the assertion and conclude that the sample could not have been selected from a normal distribution.<br>
If the p-value is greater than 0.05 then you have insufficient evidence to question the assertion and so you should treat the assertion as reasonable.

In [43]:
stats.normaltest(Second)

NormaltestResult(statistic=nan, pvalue=nan)

### Z-score and P-value

The z-score and  p-value are central to almost every statistical inference tool and hypothesis testing methods.

The **Z-score** is the measure of how many standard deviations away from the mean, a particular sample point is. <br>
The **P-value** gives us the probability of getting a z-score less than or equal to the given z-score and in a sense, is a measure of the number of sample points that have a z-score less than or equal to the corresponding value of z.

In [44]:
#using Scipy package to create P-value function
def P_value (z):
    return 1 -2 * (1- stats.norm.cdf(z))

In [20]:
P_value(1), P_value(2), P_value(3)

(0.6826894921370859, 0.9544997361036416, 0.9973002039367398)

In [21]:
#using Scipy package to create Z-score function
def Z_score (frac):
    return stats.norm.ppf(0.5+frac/2)

In [22]:
Z_score(0.50), Z_score(0.68), Z_score(0.99)

(0.6744897501960817, 0.9944578832097535, 2.5758293035489004)

# Sampling

## Credit Card Fraud Dataset

*The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.*

Source: https://data.world/raghu543/credit-card-fraud-data


### About Dataset
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.


In [23]:
data =pd.read_csv('Datasets\creditcard.csv')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [24]:
data.shape

(284807, 31)

### Estimating a Population Proportion
In the following sections, we are going to try and find the fraction of transactions that are fraudalent by examining data from only 8% of the sample

In [49]:
data_sample = data.sample(frac = 0.08)

In [50]:
#Counting the fraudlent cases or otherwise
data_sample['Class'].value_counts()

0    22745
1       40
Name: Class, dtype: int64

In [51]:
#Fraction of fraudlent transactions in a random dataset generated via Pandas
p_fraudlent = len(data_sample[data_sample['Class'] == 1]) /len(data_sample)
p_fraudlent

0.0017555409260478386

Imagine that we didn't have the data for all the credit card transactions as we do now. How would we go about estimating the real fraction from the results of this small sample? As you may have guessed, it really isn't possible to determine the exact fraction with 100% accuracy. What we can do, though, is define a confidence interval and quantitatively state that we are this much confident that the real fraction is within a particular range. In doing so, we shift from a deterministic realm to the stochastic realm of samples governed by probabilities.

The p_fraudlent that we obtained in a previous step is a random variable whose value will change in different trials of the experiment (sampling 8% of the population). Let's say that we conduct this experiment 1000 times. How will the p_fraudlent obtained in each experiment be related to each other? Let's simulate the experiment, plot the distribution and find out.

In [28]:
p_fraudlent_samples = []
for i in range(1000):
    sample = data.sample(frac =0.08)
    p_sample = len(sample[sample['Class'] == 1])/ len(sample)
    p_fraudlent_samples.append(p_sample)

In [29]:
px.histogram(p_fraudlent_samples)

As can be seen, the p's are in the form of a normal distribution. Without proof, we will represent the following results:

![Alt text](image-26.png)

The distribution becomes normal with p as the mean as n approaches infinity. Therefore, the accuracy of our value is only dependent on the spread of our p

### Reporting the results

the sample size determines the accuracy of our results. As mentioned earlier, we can never be 100% accurate with the results but we can only be confident to a certain level

In [30]:
def Z_score(frac):
    return stats.norm.ppf(0.5 + frac/2)

In [31]:
z = Z_score(0.99)
z

2.5758293035489004

**Interpretation:**
99% of the values fall within 2.575 standard deviations.

In [32]:
#Displaying the p_value for the fraudlent entries
p_fraudlent

0.0018433179723502304

In [52]:
#Sigma_fraudlent is the standard deviation obtained from the variance (p_fraudlent)
sigma_fraudlent = np.sqrt((p_fraudlent * (1 -p_fraudlent))/len(data_sample))
sigma_fraudlent

0.00027733163808948795

In [34]:
lower_limit = p_fraudlent - z*sigma_fraudlent
upper_limit = p_fraudlent + z*sigma_fraudlent
lower_limit, upper_limit

(0.001111350046528809, 0.0025752858981716517)

The code provided calculates the lower and upper limits of a confidence interval for the probability of fraud in a sample. It uses the variables "p_fraudulent" (the probability of fraud), "sigma_fraudulent" (the standard deviation), and "z" (a value representing the desired level of confidence, confidence level at 99% whereby z is 2.75).

**Interpretation**:
From the results above, we are 99% confident that the real p lies within (0.0010062639417716978, 0.0024170408640215874)

**Presenting the results** <br>
- There is a tradeoff between confidence level and range size. The higher the confidence, the larger the range.<br>
- Increasing the sample size will lead to a reduced standard deviation and therefore, more accurate and pracitically significant results

In [53]:
#testing if the results tally with the suggestion above by checking the percentage of fraudlent cases  of the whole dataset

p = len(data[data['Class'] == 1])/len(data)
p

0.001727485630620034

**Observation**: The value of p does indeed fll between the range suggested above.<br>


Finally, let us check on the mean of the p's we obtained from simulating the experiment a 1000 times. From our results, we know that this mean will approach p as n approaches infinity. Therefore, we should be reasonably confident that this mean is extremely close to the value of p.

In [36]:
expected_p = len(data[data['Class'] == 1]) /len(data)
expected_p

0.001727485630620034