# **First Innings Score Prediction**

Cricket is one of the most popular outdoor sports that has captured everyone's heart. There are many series that are held, and the Indian Premier League(IPL) is one of them which has a long and illustrious tradition in the sports world. IPL is a professional Twenty20(T20) league started in 2008 which was founded by the Board of Control for Cricket in India(BCCI). The IPL is a 20-over league, which means each team plays 20 over from both sides. Every year, eight teams from eight Indian cities pariticipate in this league. A cricket match is influenced by a variety of a factors, and the factors that have a major impact on the outcome of a T20 cricket match are described in this project. IPL Score Prediction project takes several years of IPL data, including player information, match location information, team information, and ball to ball information, and analyzes it to draw different conclusions that help in the enhancement of player's results. It focuses on calculating the results of IPL matches using data mining techniques on both balanced and imbalanced datasets. In T20 Cricket matches, the first innings score is currently estimated based on existing run rate, which is measured as the number of runs scored per a number of overs bowled. It includes the following factors:

1. Number of wickets left
2. Number of balls left
3. On how much scores are the current batsman batting?
4. How much has the team scored in the last 5 years?
5. How much did the team have lost wickets in the last 5 overs?
6. The nature of the pitch
7. How strong is the batting and bowling team?

##### **STEPS INVOLVED**

1. Firstly, the data is trained. We will take 15-20% of the data from the data collection to train the model.
2. We will take 15-20% of the data from the data collection to train the model.
3. For the prediction, we will using a Linear Regression algorithm.
4. The project is split into three separate Jupyter Notebooks: one to collect the IPL data, inspect it, and clean it; a second to further refine the features and fit the data to a Linear Regression model to train and evaluate our output.

##### **SOFTWARE REQUIREMENT:**

1. Python
2. Pandas
3. NumPy
4. Matplotlib
5. Scikit Learn
6. Jupyter Notebook/Google Colab
7. PyCharm

##### **FUNCTIONAL REQUIREMENT:**

1. The System must provide the predicted IPL Score.
2. The system must have an easy to use interface for the system for all the users.
3. The admin must be able to modify/update the dataset.
4. The dataset of the IPL score must be available for the system.

##### **DATA COLLECTION:**

Kaggle was used to collect the IPL score data. We took 80% of the data for the train set and the rest of the 20% of the data from the test set.

Parameter are:
1. Venue
2. Bat Team
3. Bowl Team
4. Batsman
5. Bowler
6. Runs
7. Wickets
8. Overs
9. Runs last 5
10. Striker
11. Non Striker
12. Total

##### **STEPS INVOLVED IN DATA PROCESSING:**

1. Feature Selection: We have a lot of unnecessary attributes in our data that we won't use in our project. As a result, we only use the attributes that we need.
2. Normalization: The initial step is to normalize the data which we have collected from the internet. Rescaling real-valued numeric attributes into the range between 0 and 1 is referred to as normalization. The data is then normalized after it has been filtered.
3. Machine Learning: The method of iteratively refining your prediction equation through looping over the dataset several times by updating the values of weight and bais in the direction suggested by the slope of the gradient(Cost Function) is known as training a model. We consider training to be complete, when we exceed an appropriate error, or when required training iterations(epochs or cycles) fail to reduce our cost.

##### **IMPLEMENTATION OF ALGORITHM:**

Linear Regression Algorithm is the algorithm used in our project.
1. Linear Regression: Regression is the method that measures the average relationship between two or more continous variables in terms of the response variable and feature variables. Also, in other words, regression analysis is to know the nature of the relationship between two or more variables to use for predicting the most likely value of dependent variables for a given value of independent variables.

##### **RESULT:**

The IPL score prediction system works properly. All of the attribute values had been preprocessed correctly. The model was applied and trained using training data after all of the preprocessing was done. The Linear Regression model accuracy was found be to be 82%. The GUI of IPL score prediction was made with HyperText Markup Language(HTML). The coding was done in Jupyter Notebook and VsCode. After completing all of the processes, we have linked the front-end(HTML) with the back-end(Python).

#####**CONCLUSION:**

Linear Regression Algorithm is applied to the IPL dataset which is very essential for improving people's future performance. Using some selected input variables obtained from Kaggle, we have created a model to forecast the IPL score. The issue with the current IPL dataset is that we are unable to organize ourselves and complete critical tasks. So, this model was created to know the IPL score with high precision when taking into account all of the factors that influence the IPL score.

##### Sports Datasets:
1. https://data.world/datasets/sports
2. https://www.kaggle.com/karangadiya/fifa19

In [30]:
# Importing essential libraries
import pandas as pd
import pickle

# Loading the dataset
df = pd.read_csv('/content/ipl.csv')

In [31]:
df.head() #view the head or first five rows of dataset

Unnamed: 0,mid,date,venue,bat_team,bowl_team,batsman,bowler,runs,wickets,overs,runs_last_5,wickets_last_5,striker,non-striker,total
0,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,SC Ganguly,P Kumar,1,0,0.1,1,0,0,0,222
1,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,1,0,0.2,1,0,0,0,222
2,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.2,2,0,0,0,222
3,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.3,2,0,0,0,222
4,1,2008-04-18,M Chinnaswamy Stadium,Kolkata Knight Riders,Royal Challengers Bangalore,BB McCullum,P Kumar,2,0,0.4,2,0,0,0,222


In [32]:
# --- Data Cleaning ---
# Removing unwanted columns
columns_to_remove = ['mid', 'venue', 'batsman', 'bowler', 'striker', 'non-striker']
df.drop(labels=columns_to_remove, axis=1, inplace=True)

In [33]:
df.head() #view head after removing unwanted columns

Unnamed: 0,date,bat_team,bowl_team,runs,wickets,overs,runs_last_5,wickets_last_5,total
0,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.1,1,0,222
1,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,1,0,0.2,1,0,222
2,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.2,2,0,222
3,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.3,2,0,222
4,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,2,0,0.4,2,0,222


In [34]:
df['bat_team'].unique()

array(['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals',
       'Mumbai Indians', 'Deccan Chargers', 'Kings XI Punjab',
       'Royal Challengers Bangalore', 'Delhi Daredevils',
       'Kochi Tuskers Kerala', 'Pune Warriors', 'Sunrisers Hyderabad',
       'Rising Pune Supergiants', 'Gujarat Lions',
       'Rising Pune Supergiant'], dtype=object)

In [35]:
# Keeping only consistent teams
consistent_teams = ['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals',
                    'Mumbai Indians', 'Kings XI Punjab', 'Royal Challengers Bangalore', 
                    'Delhi Daredevils', 'Sunrisers Hyderabad']

In [36]:
df = df[(df['bat_team'].isin(consistent_teams)) & (df['bowl_team'].isin(consistent_teams))]

In [37]:
# Removing the first 5 overs data in every match: there should be at least 5 over scores
df = df[df['overs']>=5.0]

In [38]:
df.head()

Unnamed: 0,date,bat_team,bowl_team,runs,wickets,overs,runs_last_5,wickets_last_5,total
32,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,61,0,5.1,59,0,222
33,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,61,1,5.2,59,1,222
34,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,61,1,5.3,59,1,222
35,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,61,1,5.4,59,1,222
36,2008-04-18,Kolkata Knight Riders,Royal Challengers Bangalore,61,1,5.5,58,1,222


In [39]:
print(df['bat_team'].unique())
print(df['bowl_team'].unique())

['Kolkata Knight Riders' 'Chennai Super Kings' 'Rajasthan Royals'
 'Mumbai Indians' 'Kings XI Punjab' 'Royal Challengers Bangalore'
 'Delhi Daredevils' 'Sunrisers Hyderabad']
['Royal Challengers Bangalore' 'Kings XI Punjab' 'Delhi Daredevils'
 'Rajasthan Royals' 'Mumbai Indians' 'Chennai Super Kings'
 'Kolkata Knight Riders' 'Sunrisers Hyderabad']


In [40]:
df.dtypes

date               object
bat_team           object
bowl_team          object
runs                int64
wickets             int64
overs             float64
runs_last_5         int64
wickets_last_5      int64
total               int64
dtype: object

In [41]:
# Converting the column 'date' from string into datetime object
from datetime import datetime
df['date'] = df['date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d')) 

In [42]:
df.dtypes

date              datetime64[ns]
bat_team                  object
bowl_team                 object
runs                       int64
wickets                    int64
overs                    float64
runs_last_5                int64
wickets_last_5             int64
total                      int64
dtype: object

In [43]:
# --- Data Preprocessing ---
# Converting categorical features using get_dummies method
encoded_df = pd.get_dummies(data=df, columns=['bat_team', 'bowl_team'])

In [44]:
encoded_df.head()

Unnamed: 0,date,runs,wickets,overs,runs_last_5,wickets_last_5,total,bat_team_Chennai Super Kings,bat_team_Delhi Daredevils,bat_team_Kings XI Punjab,...,bat_team_Royal Challengers Bangalore,bat_team_Sunrisers Hyderabad,bowl_team_Chennai Super Kings,bowl_team_Delhi Daredevils,bowl_team_Kings XI Punjab,bowl_team_Kolkata Knight Riders,bowl_team_Mumbai Indians,bowl_team_Rajasthan Royals,bowl_team_Royal Challengers Bangalore,bowl_team_Sunrisers Hyderabad
32,2008-04-18,61,0,5.1,59,0,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
33,2008-04-18,61,1,5.2,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
34,2008-04-18,61,1,5.3,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
35,2008-04-18,61,1,5.4,59,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0
36,2008-04-18,61,1,5.5,58,1,222,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [45]:
encoded_df.columns

Index(['date', 'runs', 'wickets', 'overs', 'runs_last_5', 'wickets_last_5',
       'total', 'bat_team_Chennai Super Kings', 'bat_team_Delhi Daredevils',
       'bat_team_Kings XI Punjab', 'bat_team_Kolkata Knight Riders',
       'bat_team_Mumbai Indians', 'bat_team_Rajasthan Royals',
       'bat_team_Royal Challengers Bangalore', 'bat_team_Sunrisers Hyderabad',
       'bowl_team_Chennai Super Kings', 'bowl_team_Delhi Daredevils',
       'bowl_team_Kings XI Punjab', 'bowl_team_Kolkata Knight Riders',
       'bowl_team_Mumbai Indians', 'bowl_team_Rajasthan Royals',
       'bowl_team_Royal Challengers Bangalore',
       'bowl_team_Sunrisers Hyderabad'],
      dtype='object')

In [46]:
len(encoded_df.columns)

23

In [47]:
# Rearranging the columns
encoded_df = encoded_df[['date', 'bat_team_Chennai Super Kings', 'bat_team_Delhi Daredevils',
                        'bat_team_Kings XI Punjab', 'bat_team_Kolkata Knight Riders',
                        'bat_team_Mumbai Indians', 'bat_team_Rajasthan Royals',
                        'bat_team_Royal Challengers Bangalore', 'bat_team_Sunrisers Hyderabad',
                        'bowl_team_Chennai Super Kings', 'bowl_team_Delhi Daredevils',
                        'bowl_team_Kings XI Punjab', 'bowl_team_Kolkata Knight Riders',
                        'bowl_team_Mumbai Indians', 'bowl_team_Rajasthan Royals',
                        'bowl_team_Royal Challengers Bangalore',
                        'bowl_team_Sunrisers Hyderabad',
                        'overs', 'runs', 'wickets', 'runs_last_5', 'wickets_last_5', 'total']]

In [48]:
len(encoded_df.columns)

23

In [49]:
# Splitting the data into train and test set
x_train = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year <= 2016]
x_test = encoded_df.drop(labels='total', axis=1)[encoded_df['date'].dt.year >=2017]

In [50]:
y_train = encoded_df[encoded_df['date'].dt.year <= 2016]['total'].values
y_test = encoded_df[encoded_df['date'].dt.year >= 2017]['total'].values

In [51]:
#Removing the 'date' column
x_train.drop(labels='date', axis=True, inplace=True)
x_test.drop(labels='date', axis=True, inplace=True)

In [53]:
# --- Model Building ---
# Linear Regression Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression()

In [54]:
# Creating a pickle file for the classifier
filename = 'lr-model.pkl'
pickle.dump(regressor, open(filename, 'wb'))