### Predictive Model for Estimating the Reach of a Training Institute on Social Media
* **Business Objective:**
The goal is to develop a predictive model that estimates the reach of a training institute’s social media campaigns. 
This will help in understanding how different social media activities impact the visibility and engagement of posts.

* **Client Ref:** For any Social Media Platform
* Features that can affect Reach,
    - Social media info (Likes,Shares,Follwers,Type of Post etc...)
* **Outcome:**
* The predictive model will analyze historical data from social media platforms and estimate the expected reach of future posts using likes shares comments followers.
* This will allow the training institute to optimize its campaigns for better engagement and visibility

<img src="K.jpg" width=600 height=300>

### TOC <a class="anchor" id="menu"></a>


* [0. Data Collection](#dc)
* [1. Data Validation & Basic Cleaning](#dv)

    @ Insights

* [2. Data Understanding (EDA)](#eda)
* [3. Missing Values & Outliers Handling](#naout)

    @ Predictive Modeling
* [4. Predictive Modeling (Machine Learning)](#pm)
    * [4.1 X & y](#xy)
    * [4.2 Feature Engineering](#fe)
    * [4.3 Train-Test Split](#tt)
    * [4.4 Model Selection & Training](#model)
    * [4.5 & 4.6 Test Predictions & Evaluation](#eval)
    * [4.7 Selecting Better Performance Model](#best)
    * [4.8 Hypparam Tuning For Best Model(if required)](#hyp)
    * [4.9 Saving Better Performance Model](#dep)
    * [4.10 Real Time Prediction](#pred)

### 0. Data <a id=dc>
    
[Back to Top](#menu)

* **For this project , We had taken manually collected dataset**

    

In [None]:
# Base Python Libraries - Data Manipulation

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Loading Data using pandas methods, with variable/object name 'raw'

rawdata=pd.read_excel("smfinal.xlsx")

In [None]:
rawdata

### column info:

| Column Name         | Description |
|---------------------|-------------|
| **Institute Name**  | Name of the institution offering the course. |
| **CourseName**      | Name of the course being analyzed. |
| **Platform**       | The platform where the course is being promoted (e.g., social media, websites). |
| **Followers**       | The number of followers of the institution or course page. |
| **type**           | The type/category of the course or post. |
| **Likes**          | Number of likes received on the post. |
| **Comments**       | Number of comments on the post. |
| **Share**          | Number of times the post was shared. |
| **Date**           | The date when the post was made. |
| **Location**       | The location associated with the institute or course. |
| **Interaction score** | A calculated score based on user interactions (likes, comments, shares, etc.). |
| **Reach**          | Taken this column based on number of follwers,type of post,number of shares. |

* **Basic Checks of Dataset**

In [None]:
rawdata.info()

In [None]:
print("First five rows of Dataset:")
display(rawdata.head())

print("Last five rows of Dataset:")
rawdata.tail()

In [None]:
print("Column data Check (Null Values & Data Types):")
print()
rawdata.info()

### 1. Data Validation & Cleaning <a id=dv>
   
[Back to Top](#menu)

In [None]:
# Taking copy of data
raw = rawdata.copy()

* **Validating Each Column Data**

In [None]:
raw.columns

In [None]:
def colcheck(df , col):
    print("column: ", col)
    print()
    print(f"Number Of Unique Values In Column:{df[col].nunique()}")
    print()
    print("unique values in column:")
    print()
    print(df[col].unique())
    print()
    print("data type of column:" , df[col].dtype)
    print()

                                        Institute Name

In [None]:
colcheck(raw,'Institute Name')

In [None]:
raw['Institute Name'].replace('"@###"',np.nan,inplace=True)
raw['Institute Name'].replace('institute@@',np.nan,inplace=True)
raw['Institute Name'].replace('nameof',np.nan,inplace=True)
raw['Institute Name'].replace('Naaa',np.nan,inplace=True)
raw['Institute Name'].replace('Nhu',np.nan,inplace=True)
raw['Institute Name'].replace(' Labsji',np.nan,inplace=True)
raw['Institute Name'].replace('new',np.nan,inplace=True)
raw['Institute Name'].replace('insti',np.nan,inplace=True)
raw['Institute Name'].replace('F#re',np.nan,inplace=True)
raw['Institute Name'].replace('Institutename',np.nan,inplace=True)
raw['Institute Name'].replace('Datafs',np.nan,inplace=True)
raw['Institute Name'].replace('Vcube )8*6','Vcube Software Solutions',inplace=True)
raw['Institute Name'].replace('Version iTt6yghb&&','Version iT',inplace=True)
raw['Institute Name'].replace('Version iT***','Version iT',inplace=True)



In [None]:
colcheck(raw,'Institute Name')

In [None]:
raw['Institute Name']=raw['Institute Name'].str.strip(" ")

                                        Course name

In [None]:
colcheck(raw,'CourseName')

In [None]:
raw['CourseName'].replace('Dats',np.nan,inplace=True)
raw['CourseName'].replace('Dlp',np.nan,inplace=True)
raw['CourseName'].replace('Datam',np.nan,inplace=True)
raw['CourseName'].replace('Dau',np.nan,inplace=True)
raw['CourseName'].replace('AIhy',np.nan,inplace=True)
raw['CourseName'].replace('Dataj',np.nan,inplace=True)
raw['CourseName'].replace('AWji',np.nan,inplace=True)
raw['CourseName'].replace('AWSmmm',np.nan,inplace=True)
raw['CourseName'].replace('AWS^^',np.nan,inplace=True)
raw['CourseName'].replace('AW66',np.nan,inplace=True)
raw['CourseName'].replace('AW##',np.nan,inplace=True)
raw['CourseName'].replace('AW%%',np.nan,inplace=True)
raw['CourseName'].replace('dT',np.nan,inplace=True)


In [None]:
colcheck(raw,'CourseName')

In [None]:
raw['CourseName']=raw['CourseName'].str.strip(" ")

                                            Platform

In [None]:
colcheck(raw,'Platform')

In [None]:
raw['Platform'].replace('rt',np.nan,inplace=True)
raw['Platform'].replace('i',np.nan,inplace=True)
raw['Platform'].replace('instagy',np.nan,inplace=True)
raw['Platform'].replace('In',np.nan,inplace=True)
raw['Platform'].replace('insta^^',np.nan,inplace=True)
raw['Platform'].replace('kol',np.nan,inplace=True)
raw['Platform'].replace('instl',np.nan,inplace=True)
raw['Platform'].replace('insoooa',np.nan,inplace=True)
raw['Platform'].replace('profile',np.nan,inplace=True)

In [None]:
raw['Platform']=raw['Platform'].str.strip(' ')
raw['Platform'].replace('insta','Insta',inplace=True)

In [None]:
colcheck(raw,'Platform')

                                                    Followers

In [None]:
colcheck(raw,'Followers')

In [None]:
raw['Followers'].replace('4#$',4,inplace=True)
raw['Followers'].replace('12ed',12,inplace=True)
raw['Followers'].replace('543op',543,inplace=True)
raw['Followers'].replace('64%%',64,inplace=True)
raw['Followers'].replace('561--',561,inplace=True)
raw['Followers'].replace('111&',111,inplace=True)
raw['Followers'].replace('33p1',33,inplace=True)

In [None]:
colcheck(raw,'Followers')

                                            Type

In [None]:
colcheck(raw,'type')

In [None]:
raw['type'].replace('po&&',np.nan,inplace=True)
raw['type'].replace('rll',np.nan,inplace=True)
raw['type'].replace('RE%%',np.nan,inplace=True)
raw['type'].replace('POSi',np.nan,inplace=True)
raw['type'].replace('eeio',np.nan,inplace=True)

In [None]:
colcheck(raw,'type')

In [None]:
  raw.columns                                                          

                                                Likes

In [None]:
colcheck(raw,'Likes')

In [None]:
raw['Likes'].replace('48lp',np.nan,inplace=True)

In [None]:
colcheck(raw,'Likes')

                                        Comments

In [None]:
colcheck(raw,'Comments')

In [None]:
raw['Comments'].replace('0j',np.nan,inplace=True)

In [None]:
colcheck(raw,'Comments')

                                                    Shares

In [None]:
raw.columns=raw.columns.str.strip(" ")

In [None]:
raw.columns

In [None]:
colcheck(raw,'Share')

                                        Date

In [None]:
colcheck(raw,'Date')

In [None]:
raw['Date'].replace('29-01-200',np.nan,inplace=True)
raw['Date'].replace('04-01---',np.nan,inplace=True)
raw['Date'].replace('18-12-=',np.nan,inplace=True)
raw['Date'].replace('10&&&',np.nan,inplace=True)
raw['Date'].replace('09-10op',np.nan,inplace=True)
raw['Date'].replace('07-02-20ui',np.nan,inplace=True)
raw['Date'].replace('04-01-2tgn',np.nan,inplace=True)
raw['Date'].replace('28*88',np.nan,inplace=True)
raw['Date'].replace('31-01-202509',np.nan,inplace=True)
raw['Date'].replace('&&',np.nan,inplace=True)
raw['Date'].replace('10-09-2((',np.nan,inplace=True)
raw['Date'].replace('28/10/20==',np.nan,inplace=True)
raw['Date'].replace('22/1ll',np.nan,inplace=True)
raw['Date'].replace('25/100024',np.nan,inplace=True)

In [None]:
raw['Day']=pd.to_datetime(raw['Date']).dt.day_name()

In [None]:
del raw['Date']

In [None]:
                                                raw.columns

                                                    Location

In [None]:
colcheck(raw,'Location')

                                                Reach

In [None]:
colcheck(raw,'Reach')

In [None]:
raw.info()

* All columns data is valid and data types are also proper

In [None]:
raw.head(2)

In [None]:
raw.shape

In [None]:
for col in raw.columns:
    if raw[col].dtype==object:
        raw[col] = raw[col].str.lower()

In [None]:
raw[raw.duplicated()]

In [None]:
raw =  raw.drop_duplicates().reset_index(drop=True)

In [None]:
raw[raw.duplicated()]

In [None]:
data=raw.copy()

### 2. EDA (Data Understanding)<a id=eda>
    
[Back to Top](#menu)

* In EDA we can do Data Analysis in two methods
    - Uni-Variate Analysis (Study of Individual column Data)
        - Descriptive + Visual Analysis
    - Bi-Varaite Analysis (Study data between two columns)
        - Descriptive + Visual Analysis
    - Multi-Variate Analysis (Study data between three or more columns)
        - Descriptive Stats

### 2.1 Uni-Variate Analysis

In [None]:
# Viz Libraries

import matplotlib.pyplot as plt
import seaborn as sns

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

* Taking user-defined module eda

In [None]:
def descriptive_stats(df):
    print("Statistical Summary of Numerical Columns:\n")
    print(df.describe())
    
    print("\nValue Counts for Categorical Columns:\n")
    for col in df.select_dtypes(include=['object']).columns:
        print(f"Column: {col}")
        print(df[col].value_counts())
        print("\n")
descriptive_stats(data)

In [None]:
def univariate_analysis(df):
    for col in df.columns:
        plt.figure(figsize=(8, 4))
        if df[col].dtype == 'object':  # Categorical columns
            value_counts = df[col].value_counts()
            if len(value_counts) < 5:  # Use pie chart for categorical columns with more categories
                plt.figure(figsize=(6, 6))
                value_counts.plot.pie(autopct='%1.1f%%')
                plt.title(f'Pie Chart of {col}')
                plt.ylabel('')
            else:  # Use count plot for fewer categories
                sns.countplot(x=df[col], order=value_counts.index)
                plt.xticks(rotation=45)
                plt.title(f'Count Plot of {col}')
            plt.show()
        else:  # Numerical columns
            sns.histplot(df[col], kde=True, bins=30)
            plt.title(f'Distribution of {col}')
            plt.show()
univariate_analysis(data)

**Uni-Variate Insights**
* There are 21 institutes data in the datset and more posts of naresh i techologies.
* There are 9 different course posts or reels in data and more posts or reels on full stack development course.
* Data is collected from instagram
* Average follwers are 23126.
* Most of the content is uploaded as post in social media.
* Average likes are 52
* There are minmum 0 and maximum 42 comments according to  data.
* Average shares are 2 according to data.
* Most of institutes data is from hyderabad.
* Low reach is higher in count than other.
* There are more number of posts or reels posted on tuesday.

**2.2 Bi/Multi-Variate Analysis**
- Descriptive Stats Measures used to study data between two or more columns.

**Bi/Multi-Variate Combo**|**Stats Measures**
----|-----------
**Numeric-Numeric-..**|**Correlation (-1 to +1)**
**Numeric-Categorical-..**|**Aggregation Functions (count, min, max, avg, sum)**
**Categorical-Categorical-...**|**FDT**

In [None]:
data.head(2)

                                                      Pure numeric
                                        
- To understand the data between number columns we can use correlation coeficient measure from descriptive stats

In [None]:
# Considering followers and likes

round(data[['Followers','Likes']].corr(),2)

In [None]:
plt.figure(figsize=(4, 3))
px.scatter(data, x='Followers', y='Likes', trendline='ols', trendline_color_override='black', width=600, height=350)

# data between two columns was shown in points in x & y axes

Insights:
* The data shows there is a weak positive correlation between Followers and Likes

In [None]:
# Considering followers and likes

round(data[['Share','Likes']].corr(),2)

In [None]:
plt.figure(figsize=(4, 3))
px.scatter(data, x='Likes', y='Share', trendline='ols', trendline_color_override='black', width=600, height=350)

# data between two columns was shown in points in x & y axes

Insights:
* The data shows that there is weak positive correlation between Likes and Shares

                                                         Pure categorical

In [None]:
###### Checking Categorical Columns

data.select_dtypes("O")

In [None]:
# Using pandas df crosstab to get FDT between two columns
pd.crosstab(data['Institute Name'], data['Reach'], margins=True)

In [None]:
px.bar(data,x='Institute Name',y='Reach')

Insights:
* More posts or reels have low reach.

In [None]:
# We can use crosstab function in pandas to get FDT (Frequency Distribution Table) of each class

print("CourseName vs Type:")

display(pd.crosstab(data['CourseName'], data['type'], margins=True))

Insights:
* There are more number of posts than reels

                                                             Mixed

In [None]:
# Taking pandas df groupby to get aggregations between above columns
round(data.groupby('Institute Name')['Likes'].sum().sort_values(ascending=False),2) # Considering sum

In [None]:
px.box(data, x="Institute Name", y="Likes")

Insights:
* Nxtwave disruptive technologies posts or reels have more number of likes

In [None]:
# Taking pandas df groupby to get aggregations between above columns
round(data.groupby('CourseName')['Likes'].sum().sort_values(ascending=False),2) # Considering sum 

In [None]:
px.box(data, x="CourseName", y="Likes")

* AWS,Devops course has more number of likes

In [None]:
data.columns

In [None]:
# More than two columns

data[['Institute Name','CourseName','Likes','Share']]

In [None]:
round(data.groupby(['Institute Name','CourseName'])[['Likes', 'Share']].sum())

* **Insights::**

### Visualizations can be done along with Descriptive Stats for EDA
- Visualizations are graphical representation of data with descriptive stats
- Insights can be taken as same as descriptive stats

**In Python we can do visualization of data using below modules,**
- Matplotlib
- Pandas
- Seaborn
- Plolty

* **Uni-Variate Graphs**

In [None]:
data.head(2)

In [None]:
# Column Data: Category

classes = data['CourseName'].value_counts().index
vals = data['CourseName'].value_counts().values

plt.style.use("ggplot")
plt.figure(figsize=(8,8))
plt.pie(x = vals, labels = classes, autopct=lambda p:f'{p:.2f}%, ({p*sum(vals)/100 :.0f})', explode=[0.1, 0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1]) 
# use explode=[0.5, 0.1, 0.5, 0.1, 0.5, 0.1] for seperate pies (number of values are number of classes)
plt.title("Course comparision")
plt.legend()
plt.show()

* **Insights:**
    * Most of the institutes offer Full stack development course

                                                                Bar Chart
    -> syntax: plt.bar(classes, vals)

In [None]:
# Column Data: Region

classes = data['Reach'].value_counts().index
vals = data['Reach'].value_counts().values

plt.style.use("dark_background")
plt.figure(figsize=(15,8))
plt.bar(classes, vals)
plt.title("Comparission of Reach column")
plt.ylabel("Frequency")
plt.show()

* Most of posts or reels have low reach.

                                                     Numeric Col Data

In [None]:
# Column Data: Salary

plt.hist(data['Share'])
plt.title("Distribution of Shares in Social Media")
plt.show()

* **Bi-Variate Graphs**

                                                    N-N

                                                 Scatter Plot
      -> syntax: plt.scatter(data,x,y)

In [None]:
# # Column Data: Shares and Likes

plt.scatter(data = data, x='Likes', y='Share', s=100, marker='*')
plt.title("Likes and Shares of institutes in Social Media")
plt.xlabel("Likes")
plt.ylabel("Shares")
plt.show()

* The data shows that there is weak positive correlation between Likes and Shares

In [None]:
# Horizontal Bargraph: 
# Column Data: Area
data['Institute Name'].value_counts().sort_values(ascending=False)[0:10].plot(kind='barh', color='green', figsize=(6,5), title='Comparision ofInstitutes')                                                 

* naresh i technologies institute have more number of posts or reels.

In [None]:
# Boxplot: 
# Column Data: Followers
data['Followers'].plot(kind='box')

In [None]:
# Kde plot: 
# Column Data:Comments
data['Comments'].plot(kind='kde')

In [None]:
# Scatter plot : 
# Column Data: Shares Vs Likes with Reach

sns.scatterplot(data=data, x='Likes', y='Share', hue='Reach')

In [None]:
sns.pairplot(data)

In [None]:
sns.pairplot(data, hue='Reach') # For the hue we need to always take categorical col

In [None]:
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap='viridis')

* There are no strong correlations found between numeric columns.

In [None]:
data.columns

In [None]:
# Catplots: MultiVariate plots
sns.catplot(data=data, y='type', x='Location', hue='Share', orient='h')

**Overall Insights on Data**
* **Uni-Variate Insights**
* There are 21 institutes data in the datset and more posts of naresh i techologies.
* There are 9 different course posts or reels in data and more posts or reels on full stack development course.
* Data is collected from instagram
* Average follwers are 23126.
* Most of the content is uploaded as post in social media.
* Average likes are 52
* There are minmum 0 and maximum 42 comments according to  data.
* Average shares are 2 according to data.
* Most of institutes data is from hyderabad.
* Low reach is higher in count than other.
* There are more number of posts or reels posted on tuesday.
* **Study of Data between two or more columns**
* No Strong correlations found between numeric columns
    * There is weak positive correlation between likes vs shares and followers vs likes
* More posts or reels have low reach.
* There are more number of posts than reels
* Nxtwave disruptive technologies posts or reels have more number of likes
* AWS,Devops course has more number of likes

### 3. Missing Values & Outlier Handling<a id=naout>
    
[Back to Top](#menu)

#### 3.1 Missing Values

**Once we have a dataset collected, validated & analyzed, then before entering into predictive modeling we can do na & out handling , because most of the algorithms in predictive modeling can not accept missing values and also don't give better performance for outliers.**

* We need to find and handle both of them

In [None]:
vdata = data.copy()

* Empty values or any data point which is not belongs to column
* Identify Missing Values
    - Check for Standard & Non-Standard nan values 
* Handle the Missing Values
    - Drop (Row, Column)
    - Replace (MCT, Imputation, etc...)

**a) Identification**

                                                     Column Wise

In [None]:
# pandas function for each column missing values count

vdata.isnull().sum()

* All the columns are having missing values

* if any feature/column having more than 70% of the data missing then we can consider it for drop
    - if we consider column importance for business then we need to replace values with variety of data

In [None]:
# Checking Na count percentage for each column

round((vdata.isnull().sum()/len(vdata))*100,2)

                                                            Row

* For the row wise drop , pick rows having missing values more than half of the columns
* We can pick these rows and we can drop them

In [None]:
vdata[vdata.isnull().sum(axis=1)>=6]

# we have 11 columns , considering 6 half of the cols data in row

* Rows Found with half missing data

In [None]:
###### To Drop rows having more than half na values 

###### Taking Indexes of that rows

allnaindx = vdata[vdata.isnull().sum(axis=1)>=6].index

print("9 rows deleted",len(allnaindx))

##### Drop the above all na indx

vdata = vdata.drop(allnaindx, axis=0).reset_index(drop=True)

In [None]:
vdata.shape

* Check of missing values after drop method

In [None]:
vdata.isnull().sum()

In [None]:
vdata.dtypes

#### b.2) Replace 
    Replacing Missing Values can be done Col wise

In [None]:
# numeric columns replaced with median
vdata.Followers.fillna(vdata.Followers.median(), inplace=True)
vdata.Likes.fillna(vdata.Likes.median(), inplace=True)
vdata.Comments.fillna(vdata.Comments.median(), inplace=True)
vdata.Share.fillna(vdata.Share.median(), inplace=True)

In [None]:
vdata.isnull().sum()

###### categorical column null value replacement with least mode

In [None]:
vdata['Day'].value_counts()

In [None]:
vdata[vdata['Day'].isnull()].index

In [None]:
indexes11=vdata[vdata['Day'].isnull()].index

In [None]:
len(indexes11)

In [None]:
vdata['Day'].iloc[indexes11[0:9]]='friday'
vdata['Day'].iloc[indexes11[9:18]]='saturday'
vdata['Day'].iloc[indexes11[18:]]='sunday'

In [None]:
# first check for mode of that column
vdata['Institute Name'].mode()

In [None]:
vdata['Institute Name'].value_counts()

In [None]:
# 36 null values in Institute Name column(12+12+12=36)


In [None]:
vdata[vdata['Institute Name'].isnull()].index

In [None]:
indexes=vdata[vdata['Institute Name'].isnull()].index

In [None]:
indexes

In [None]:
len(indexes)

In [None]:
vdata['Institute Name'].iloc[indexes[0:12]]='besant technologies'
vdata['Institute Name'].iloc[indexes[12:24]]='saidemy'
vdata['Institute Name'].iloc[indexes[24:]]='social prachar'

                                                    CourseName

In [None]:
vdata['CourseName'].value_counts()

In [None]:
vdata[vdata['CourseName'].isnull()].index

In [None]:
indexes1=vdata[vdata['CourseName'].isnull()].index

In [None]:
len(indexes1) # 54=18+18+18

In [None]:
vdata['CourseName'].iloc[indexes1[0:18]]='gcp'
vdata['CourseName'].iloc[indexes1[18:36]]='ds'
vdata['CourseName'].iloc[indexes1[36:]]='dataskills'

In [None]:
vdata.isnull().sum()

                            platform

In [None]:
vdata['Platform'].value_counts()

In [None]:
vdata.Platform.fillna(vdata.Platform.mode()[0], inplace=True)

In [None]:
vdata['type'].value_counts()

In [None]:
vdata[vdata['type'].isnull()].index

In [None]:
indexes23=vdata[vdata['type'].isnull()].index

In [None]:
len(indexes23)

In [None]:
vdata['type'].iloc[indexes23[0:20]]='reel'

In [None]:
vdata.isnull().sum()

                                        Reach

In [None]:
vdata['Reach'].value_counts()

In [None]:
vdata[vdata['Reach'].isnull()].index

In [None]:
indexes45=vdata[vdata['Reach'].isnull()].index

In [None]:
len(indexes45)

In [None]:
vdata['Reach'].iloc[indexes45[0:97]]='moderate'
vdata['Reach'].iloc[indexes45[97:194]]='high'
vdata['Reach'].iloc[indexes45[194:]]='low'

In [None]:
vdata.isnull().sum()

In [None]:
vdata

**3.2 Outliers**

In [None]:
from outlier import outlier_detect, outlier_replacement

In [None]:
outcols = outlier_detect(vdata)

In [None]:
outlier_replacement(vdata, outcols)

In [None]:
# Final Check of Outliers

outlier_detect(vdata)

* Almost all the outliers in columns are replaced

### 4. Predictive Modeling<a id=pm>
    
[Back to Top](#menu)
    
* Building a predictive model/trained algorithm to get the relation betweeen one col(y) to other columns (X)

#### 4.1 Selecting X & y <a id=xy>
    
[Back to Top](#menu)

* Selecting Output column (y) - future prediction column & Input column/columns (X) - Reference columns

    -  X (independent variables/input columns/explanatory variables)
    - y (dependent variable/output column/response column

In [None]:
vdata.head(2)

* For this we dataset we want to predict Class Reach, taking **Reach column as Output (y)**
    - **Remaining Columns data can be taken as input (X)**

In [None]:
#df = df.drop(columns=['B', 'C'])
#X=vdata.drop(columns=['Platform','Likes','Comments','Share','Reach'])
X=vdata[['Institute Name','CourseName','Followers','type','Location','Day']]

In [None]:
y=vdata['Reach']

In [None]:
X.head(2)

In [None]:
y.head(2)

In [None]:
print("Input Columns Data (X):")
display(X.head())
print()
print("Output Column Data (y):")
display(y.head())

In [None]:
y.value_counts()

y column is balanced

#### 4.2  Feature Engineering of X<a id=fe>

[Back to Top](#menu)

* Generation/ Modification / Deletion / Selection of X columns/features according to y column

In [None]:
X.head()

* Need to handle Institute Name,Course name,type,Location,Date

#### 4.2.1 Feature Generation

* Done in Data validation

#### 4.2.2 Feature Selection/Deletion

**Considering all the  columns for predictive modeling, as each column has its own importance to Reach**

* If the model performance is not ok, then we can comeback to this step and we can select imp x features using statistical analysis

In [None]:
# Saving Data for UserInput
X.to_csv("inputdata.csv",index=False)

In [None]:
df = X.copy()

In [None]:
df.head()

#### 4.2.3 Feature Modification (Data Pre-Processing)

**Encoding**

* Converting categorical data to numeric

In [None]:
# Selecting Categorical Data
X.select_dtypes("O").head()

In [None]:
cat_cols = X.select_dtypes("O")

In [None]:
cat_cols

* **Checking number of categories in above columns and applying encoding techniques**

In [None]:
for col in cat_cols:
    print(col, ":", cat_cols[col].nunique())
    print(cat_cols[col].unique())
    print()

* From above, each column is having more than two classes except platform,type,location
* Need to Identify ordinal and nominal columns to apply proper encoding
    - Ordinal Encoding - for Ordinal Cols
    - One-Hot Encoding - for Nominal Cols

* From Business Pov we can consider,

      Binary columns: type,location
      Ordinal columns: Day,course name
      Nominal columns: institute name

                               Binary encoding-- type

In [None]:
X.type.unique()

In [None]:
X.type.replace({'post':0,'reel':1}, inplace=True)

                            Binary encoding-location

In [None]:
X.Location.unique()

In [None]:
X.Location.replace({'hyderabad':0,'banglore':1}, inplace=True)

* **Ordinal Columns Data Encoding**

In [None]:
X[['Day','CourseName']]

* We can replace above cat classes with ordinal numbers, 
    - Those numbers can be,
        - target encoding: output column value according to class or
        - ordinal number according to alphabetical order
        - or our own choice of numbers

* **for the above ordinal cols considering target encoding**

In [None]:
# Writing a loop

import pickle # Module to save encoding dictionaries

for col in X[['Day','CourseName']].columns:
    # Taking value counts of each class as it represents sum of yes and no values of output column
    grouped = X[col].value_counts()
    ordencoding = {brand:value for brand, value in zip(grouped.index, grouped.values)}
    # saving above encoding for future use
    with open(f'{col}_encoding.pkl', 'wb') as f:
        pickle.dump(ordencoding, f)
        
    X[col].replace(ordencoding, inplace=True)    
    
# To load encoding for future use 

# with open('saved_dictionary.pkl', 'rb') as f:
#     loaded_dict = pickle.load(f)

In [None]:
X.head()

* **Nominal Column Data Encoding**      - **Institute name**

In [None]:
# Library
from sklearn.preprocessing import OneHotEncoder

# Define Object
ohe = OneHotEncoder(handle_unknown='ignore')

# handle_unknown=ignore -> future classes will be ignored
# drop=first -> is another param for dummy variable trap

In [None]:
#Using fit_transform method to convert column data into onehot encodings

ohedata = ohe.fit_transform(X[['Institute Name']]).toarray()


In [None]:
ohedata

* Adding above one hot encoded data to X

In [None]:
# Converting to dataframe

ohedata = pd.DataFrame(ohedata, columns=ohe.get_feature_names_out())

In [None]:
ohedata

In [None]:
# Droping  CourseName , Platform with Ohedata

X = pd.concat([X.drop(['Institute Name'], axis=1), ohedata], axis=1)

In [None]:
X.head()

**Scaling**

* Converting main numeric columns under one scale if necessary

* Scaler Suggested if columns data is on different scales
    - Standard Scaler (-3 to +3)
    - Robust Scaler (When we have outliers in columns)
    - MinMaxScaler (-1 to +1)

* Numeric Cols are in different scales , need to apply scaling
    - Followers,Likes,Comments,share

In [None]:
X.iloc[:,[1]]

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X.iloc[:,[1]] = sc.fit_transform(X.iloc[:,[1]])

* **Final X Data for modeling**

In [None]:
X.head()

* **y data**

In [None]:
y.head()

#### 4.3 Train-Test Split of X & y<A id=tt>
    
[Back to Top](#menu)
    
* Dividing data into two parts, train-test
    - train part used for model building
    - test part is used for model evaluation

In [None]:
# sklearn method

from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.30, random_state=123)

In [None]:
xtrain.shape, ytrain.shape  # model training data

In [None]:
xtrain

In [None]:
print(ytrain)

In [None]:
xtest.shape, ytest.shape # model evaluation data

**4.4 Modeling/Algorithm Training on Train Data**

* Sending xtrain & ytrain data to a algorithm, where it can study the patterns and gives predictive model to generate y for future x values

* Taken **y data is categorical**, we can apply **machine learning supervised classification algorithms**
    
* In Classification we have below algorithms
    - Logistic Regression
    - Knearest Neighbors (KNN)
    - Support Vector Machine (SVM)
    - Naive Bayes (NB)
    - Decision Trees (CART)
    - Random Forest (Bagging)
    - Xgboost (Boosting)

* Importing Sklearn Library Models Functions

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

* Model Objects Defining

In [None]:
# Model Define

log = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)


# ---------------------------------------------------------
knn = KNeighborsClassifier(n_neighbors=5, p=1)

# Here neighbors are the hyperparameter
# Distance is the another hyperparameter (p) 2 for euclidean distance

# ---------------------------------------------------------
dt = DecisionTreeClassifier() # Taking default Hyper params

# We can try hyp params:
# criteria is the root node selection method
# max_depth is the number of subtrees in decision  tree - main Hyperparameter

# -----------------------------------------------------------
rf = RandomForestClassifier(
    n_estimators=100,  # Number of trees
    max_depth=20,      # Maximum depth of trees
    min_samples_split=5, # Minimum samples required to split a node
    min_samples_leaf=2,  # Minimum samples required at a leaf node
    random_state=42
)

# Train the model on your training data
# Taking Default Hyp params

# We can try hyp params:
# n_estimators are number of decision trees - Hyper parameter

# ------------------------------------------------------------
sv = SVC(kernel = 'rbf', gamma=5) # for a non-linear seperable data

# Gamma=Sigma=coeffient for the rbf kernel - hyperparameter
# Kernel linear-- Linear SVM
#sv = SVC(kernel="linear") # for a linear separable data

# -----------------------------------------------------------
nb = GaussianNB()
    
# -----------------------------------------------------------
xgb = XGBClassifier() # Taking default Hyper params

**Model Training**

* Using xtrain, ytrain data
* Using fit command to train the defined model with xtrain, ytrain

**4.4.1 Logistic Regression**

* 
It uses the Linear Regression line to convert it into a sigmoid curve with the logit function output as probability of class

prob = 1/1+e^-y

if prob>0.5 1 
else 0

* Learning/Training Model on train data
* 
we can use fit function in model for xtrain and ytrain data to train our data for getting the line co-efficients

In [None]:
log.fit(xtrain, ytrain)

Parameters

In [None]:
log.intercept_

In [None]:
log.coef_

**4.4.2 KNN - K Nearest Neighbors**
* It will take the nearest data points using euclidean distance metric with number of k given

* It is a lazy algorithm , it wont train the data instead it will store the data

* It will do the training when test data given

In [None]:
knn.fit(xtrain, ytrain)

In [None]:
knn.get_params()

**4.4.3 Decision Tree**
* Logic Tree based predictions based on root and interior nodes, branches

In [None]:
dt.fit(xtrain, ytrain)

* **Feature Importance**

In [None]:
pd.DataFrame(index = dt.feature_names_in_,data = [round(val,2) for val in dt.feature_importances_], columns = ['FeatureImportance'])

* **Tree**

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize = (30,30), dpi = 150)
plot_tree(dt,filled = True, feature_names=list(xtrain.columns))
plt.show()

**4.4.4 Random Forest**
* Bagging algorithm which was a combination of Multiple Decision Trees


In [None]:
rf.fit(xtrain, ytrain)

In [None]:
pd.DataFrame(index = rf.feature_names_in_,data = [round(val,2) for val in rf.feature_importances_], columns = ['FeatureImportance'])

* Trees

In [None]:
rf.estimators_

In [None]:
plt.figure(figsize = (15,10),dpi = 150)
plot_tree(rf.estimators_[1],filled = True, feature_names=xtrain.columns)
plt.show()

**4.4.5 SVM (Time Taking for Higher Dimensional Data)**
* Support vectors (Data Points grouped with Soft Margin Classifier) - for linear data

* for non-linear data kernel trick is used to divide classes - rbf , poly

In [None]:
sv.fit(xtrain, ytrain)

In [None]:


sv.get_params()

**4.4.6 Naive Bayes**
* Naive Bayes works on Bayesian Probability formula

In [None]:
nb.fit(xtrain, ytrain)

**4.4.7 Xgboost**
* Boosting Algorithm where for the selected number of models , one model error will be trained by another model

  We need to install xgboost, using anaconda prompt - pip install xgboost

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

ytrain_xg = le.fit_transform(ytrain)

In [None]:
# xgboost accepts label data as number

ytrain_xg = np.where(ytrain == 'low', 0, np.where(ytrain == 'moderate', 1, 2))

In [None]:
xgb.fit(xtrain,ytrain_xg)

**4.5 Test Predictions & 4.6 Model Evaluation/Performance**

* Checking Trained Model Performances on Test Data

* Using x_test data we will be getting predictions, these predictions will be compared to y_test

* To check Model Performance we can use evaluation methods

    * Error/Loss
    * Model Score 
    * Bias-Variance Trade off (Underfit or Overfit)
    * Cross-Val Score

* **For classification we can use these evaluation**
    


Performance Metric | Classification
-------|-----------
**Loss or Error**|**Confusion Matrix (Number of right/wrong predictions)**
**Model Score (Evaluation)** | **Accuracy Score (Balanced Data) , F1-Score/Auc-Roc Score (For Imbalanced Data)**
**Bias-Variance Trade Off**|Higher error & Lower score (underfit)
-|Low Train error & High Test error (Overfit)
**Cross-Val Score**|Checking trained model performance on entire X and y data

In [None]:
y.value_counts()

* **As we have nearly balanced data considering accuracy score for understanding model performance**

In [None]:
# Modules for Metrics

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay, roc_curve, roc_auc_score, auc
from sklearn.model_selection import cross_val_score
from tabulate import tabulate
import numpy as np

* **Checking the above models perfomance using Test data**

In [None]:
names = ['LogisticRegression', 'KNearestNeighbors', 'SVM', 'Naive Bayes', 'Decision Tree', 'Random Forest', 'Xgboost']

models = {'log':log,'knn':knn,'svm':sv, 'nb':nb, 'dt':dt, 'rf':rf, 'xgb':xgb}

* **Confusion_matrix , Classification_report**

In [None]:
from simple_colors import *

n=0
for key,value in models.items():
    print(green("Model: {}\n".format(names[n]),['bold']))
    if key == 'xgb':
        ytest_pred = models[key].predict(xtest)
        ytest_xg = np.where(ytest == 'low', 0, np.where(ytest == 'moderate', 1, 2))
        print("Classification Report:\n",classification_report(ytest_xg, ytest_pred))
        print(blue("Confusion_Matrix:",['bold']))
        plt.show(ConfusionMatrixDisplay.from_estimator(models[key], xtest, ytest_xg))
        print("-----------------------------------------------------------------------------------")
    else:
        ytest_pred = models[key].predict(xtest.values)
        print("Classification Report:\n",classification_report(ytest, ytest_pred))
        print(blue("Confusion_Matrix:",['bold']))
        plt.show(ConfusionMatrixDisplay.from_estimator(models[key], xtest.values, ytest))
        print("-----------------------------------------------------------------------------------")
    n+=1

**Making Table for accuracy scores of all the models for train and test data to decide best model**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from tabulate import tabulate
import numpy as np

results = []

for name, model in models.items():
    if name == 'xgb':
        # Processing for XGBoost model
        ytrain_xg = np.where(ytrain == 'low', 0, np.where(ytrain == 'moderate', 1, 2))
        ytest_xg = np.where(ytest == 'low', 0, np.where(ytest == 'moderate', 1, 2))

        ytrain_pred = model.predict(xtrain)
        ytest_pred = model.predict(xtest)

        # Accuracy Score
        trscore = round(accuracy_score(ytrain_xg, ytrain_pred), 2)
        tescore = round(accuracy_score(ytest_xg, ytest_pred), 2)

        # Bias-Variance Trade off
        if trscore < 0.50 and tescore < 0.50:
            if abs(trscore) == 0 and abs(tescore) == 0:
                fit = "Nofit"
            else:
                fit = "Underfit"
        else:
            if abs(trscore - tescore) < 0.10:
                fit = "Goodfit"
            elif abs(trscore - tescore) >= 0.10:
                fit = "Overfit"
            else:
                fit = "Fit"

        y_xg = np.where(y == 'low', 0, np.where(y == 'moderate', 1, 2))

        # Cross-val score
        scores = cross_val_score(model, X, y_xg, cv=2, scoring='f1_micro')
        crossvalscore = round(scores.mean(), 2)

    else:
        # Processing for other models
        ytrain_pred = model.predict(xtrain.values)
        ytest_pred = model.predict(xtest.values)

        # Accuracy Score
        trscore = round(accuracy_score(ytrain, ytrain_pred), 2)
        tescore = round(accuracy_score(ytest, ytest_pred), 2)

        # Bias-Variance Trade off
        if trscore < 0.50 and tescore < 0.50:
            if abs(trscore) == 0 and abs(tescore) == 0:
                fit = "Nofit"
            else:
                fit = "Underfit"
        else:
            if abs(trscore - tescore) < 0.10:
                fit = "Goodfit"
            elif abs(trscore - tescore) >= 0.10:
                fit = "Overfit"
            else:
                fit = "Fit"

        # Cross-val score
        scores = cross_val_score(model, X.values, y, cv=2, scoring='f1_micro')
        crossvalscore = round(scores.mean(), 2)

    # Append results
    results.append([name, f"{trscore:.4f}", f"{tescore:.4f}", f"{crossvalscore:.4f}", fit])
# Print the results in a tabular format
print(tabulate(results, headers=["Model", "Train Accuracy", "Test Accuracy", "Cross-Validation Score", "Model Fit"], tablefmt="grid"))

**4.7 Best Model**<a id='best'>
    
[Back to Top](#menu)

* From the Observation of above results based on test score

    * **logistic regression,random forest is better when compared to others**

**4.8 Hyp Param Tuning**<a id='hyp'>
    
[Back to Top](#menu)

* doing hyper parameter tunning for random forest

**4.9 Saving Model**<a id='dep'>
    
[Back to Top](#menu)
                                                    
- Saving Trained & Evaluated model for future predictions
    - From above training we need to save objects
    - In python we have libraries , joblib, pickle to save model files

#### 4.10 Realtime Prediction<a id='pred'>

[Back to Top](#menu)

In [None]:
import joblib

# Saving trained model
joblib.dump(log, 'logistic.pkl')
# Saving onehot encoded model

joblib.dump(ohe, 'ohe.pkl')


joblib.dump(sc, "sc.pkl")

In [None]:
# Loading saved pickles and getting predictions

import joblib
log= joblib.load('logistic.pkl')
feature_names = log.feature_names_in_.tolist()

sc = joblib.load("sc.pkl")

# Loading Saved object files
ohe = joblib.load('ohe.pkl')
   

# Loading Saved Ordinal Encoded files
with open('Day_encoding.pkl', 'rb') as f:
    Day_encoding = pickle.load(f)
    
with open('CourseName_encoding.pkl', 'rb') as f:
    CourseName_encoding = pickle.load(f)

In [None]:
import pandas as pd
import numpy as np

def ReachinSocialMedia_Prediction():
    print("Reference Data for Input:")
    
    # Load input data
    inpdata = pd.read_csv("inputdata.csv")
    display(inpdata.head())

    print("\nLogistic regression built on the below X columns:")
    print(inpdata.columns)

    print("\n======================= Enter User Input Data ====================")

    # User Inputs
    print("\nEnter Institute Name:")
    print(inpdata['Institute Name'].unique())
    insti = input("Select institute: ").strip()

    print("\nEnter Course Name:")
    print(inpdata['CourseName'].unique())
    course = input("Select course: ").strip()

    print("\nEnter Followers:")
    followers = eval(input(f"min-{inpdata['Followers'].min()}, max-{inpdata['Followers'].max()}: "))

    print("\nEnter type of post:")
    print(inpdata['type'].unique())
    typeofpost = input("Select type: ").strip()

    print("\nEnter Location:")
    print(inpdata['Location'].unique())
    location = input("Select location: ").strip()

    print("\nEnter Day:")
    print(inpdata['Day'].unique())
    day = input("Select day: ").strip()

    # Create user input DataFrame
    row = pd.DataFrame([[insti, course, followers, typeofpost, location, day]], columns=inpdata.columns)

    print("\nGiven User Input Data:")
    display(row)

    ####### Data Pre-Processing #######

    # Binary Encoding
    row['type'].replace({'post': 0, 'reel': 1}, inplace=True)
    row['Location'].replace({'hyderabad': 0, 'banglore': 1}, inplace=True)

    # Check if Encoding Dictionaries Exist
    try:
        row['Day'].replace(Day_encoding, inplace=True)
    except NameError:
        print("\nError: Day_encoding dictionary is missing.")
        return

    try:
        row['CourseName'].replace(CourseName_encoding, inplace=True)
    except NameError:
        print("\nError: CourseName_encoding dictionary is missing.")
        return

    try:
        ohe = joblib.load('ohe.pkl')
    
        if isinstance(ohe, list):
            ohe = ohe[0]  

        # Get feature names
        ohe_feature_names = ohe.get_feature_names_out().tolist()

        # Transform the input data
        row_ohe = ohe.transform(row[['Institute Name']])  
        row_ohe = pd.DataFrame(row_ohe, columns=ohe_feature_names)

        
    except ValueError as e:
        print("\nWarning: Given Institute Name is not in training data. Proceeding with all zeros.")
        row_ohe = pd.DataFrame(np.zeros((1, len(ohe.get_feature_names_out()))), columns=ohe.get_feature_names_out())
        row = pd.concat([row.drop(['Institute Name'], axis=1), row_ohe], axis=1)

    # Normalize Followers
    row[['Followers']] = sc.transform(row[['Followers']])

    # Ensure feature alignment with the trained model
    expected_features = log.feature_names_in_
    row = row.reindex(columns=feature_names, fill_value=0)


    print("\nProcessed features for prediction")
    print("********** Logistic Prediction ***********")

    # Prediction
    try:
        probs = log.predict_proba(row)[0]
        print(probs)
        for category, prob in zip(log.classes_, probs):
            print(f"{category}: {round(prob, 2)}")

        # Get highest probability category
        global predicted_category
        predicted_category = log.classes_[probs.argmax()]
        print("\nPredicted Category:", predicted_category)

    except Exception as e:
        print("\nError during prediction:", e)



In [None]:
ReachinSocialMedia_Prediction()

### Regression

In [None]:
df

In [None]:
Xr = pd.concat([df, vdata['Reach']], axis=1)

In [None]:
Xr.head()

In [None]:
Xr.to_csv("inputdata1.csv",index=False)

In [None]:
Xr.head()

In [None]:
X = pd.concat([X, vdata['Reach']], axis=1)

Encoding

In [None]:
X['Reach'] = X['Reach'].replace({'low': 0, 'moderate': 1, 'high': 2})

In [None]:
X.head()

In [None]:
Y=vdata[["Likes","Share","Comments"]]

In [None]:
X.head()

In [None]:
X['Reach'] = X['Reach'].replace({'low': 0, 'moderate': 1, 'high': 2})

In [None]:
X.head()

In [None]:
Y.head()

**4.3 Train_Test Split (Data Validation)**

In [None]:
# Module
from sklearn.model_selection import train_test_split

In [None]:
# Split
xtrainr, xtestr, ytrainr, ytestr = train_test_split(X, Y, test_size=0.20, random_state=123)

In [None]:
# Checking Shapes
xtrainr.shape, xtestr.shape, ytrainr.shape, ytestr.shape

* Data Used for Model Training

In [None]:
display(xtrainr.head())
display(ytrainr.head())

* Data Used for Model Testing/Evaluation

In [None]:
display(xtestr.head())
display(ytestr.head())

**Model Selection & Python Libraries**

* We have output col data numeric, considering regression algorithms

    - **Regression Models/Algorithms:**
        * Linear Algorithms (when the data is linear to output (having correlation))
            - Linear Regression
            - Polynomial Regression
            - Lasso & Ridge Regression

        * Non-Linear Algorithms (when the data is non-linear to output (not having correlation) using classification algorithms)
            - Decision Tree Regressor
            - RandomForest Regressor
            - Xgboost Regressor
            - Support Vector Regressor
            - K Nearest Neighbors Regressor
            
Note:
    
* Data we got is linear & non-linear

In [None]:
# Algorithm Modules

from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor  

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, Ridge

from sklearn.neighbors import KNeighborsRegressor

from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor 


**4.4 Modeling - Defining & Training**

In [None]:
# Multiple Linear Regression 

mlr =LinearRegression()

# Polynomial Regression

polyfeat = PolynomialFeatures(degree = 2)  # degree is hyperparam

poly = LinearRegression()

# Lasso (L1) & Ridge (L2)

lasso = MultiOutputRegressor(Lasso(alpha=0.01))# alpha/lambda - hyperparam - penalty

ridge = MultiOutputRegressor(Ridge(alpha=1))



# KNN

knnr = MultiOutputRegressor(KNeighborsRegressor(n_neighbors=5)) # n_neighbors - hyper param

# Support Vector Regressor

svr = MultiOutputRegressor(SVR(kernel='rbf'))
# Decision Tree Regressor

dtr = DecisionTreeRegressor()

# Random Forest regressor 

rfr = RandomForestRegressor(n_estimators=50,min_samples_split=2,min_samples_leaf=2,max_features='sqrt', max_depth= 20) # n_estimators - hyperparam - number of decision trees


# Xgb

xgbr = XGBRegressor(subsample= 0.6, n_estimators=500, max_depth=5, learning_rate=0.01, gamma=0,colsample_bytree= 0.6)


                                                    Training above defined models one by one

                                                          (with xtrarin, ytrarin)

* We can use fit method on above objects to train model

**4.4.1 Linear regression**

In [None]:
# Model Training

mlr.fit(xtrainr, ytrainr)

                                                        Model Params


In [None]:
mlr.coef_, mlr.intercept_

In [None]:
eq = str(mlr.intercept_)

for i,j in zip(xtrain.columns,mlr.coef_):
    mx = '{}*{}'.format(i,j)
    eq = eq+" + "+mx

In [None]:
eq

**4.4.2 Polynomial Regression**

**The Dimensionality will become more and will take heavy run time if
we take all the inputs**

In [None]:
xtrainr.head(2)

In [None]:
# Converting x data to poly features

xtrainr_poly = polyfeat.fit_transform(xtrainr) # fit_transform on train

xtestr_poly = polyfeat.transform(xtestr) # transform on test

In [None]:
xtrainr_poly.shape, xtestr_poly.shape

In [None]:
# Applying Linear Regression to above polynomial features

poly.fit(xtrainr_poly, ytrainr)

In [None]:
# params

poly.coef_, poly.intercept_

**4.4.3 Lasso and Ridge**

In [None]:
# Model Training

lasso.fit(xtrainr,ytrainr)
ridge.fit(xtrainr,ytrainr)

In [None]:
lasso.estimators_

In [None]:
ridge.estimators_

**4.4.4 KNN Regressor**

In [None]:
knnr.fit(xtrainr, ytrainr)

In [None]:
knnr.get_params()

**4.4.5 Support Vector Regressor**

In [None]:
svr.fit(xtrainr, ytrainr)

In [None]:
svr.get_params()

**4.4.6 Decision Tree Regressor**

In [None]:
dtr.fit(xtrainr, ytrainr)

In [None]:
# Model Params

print("Model Params:")
print(dtr.get_params())
print()
print("Columns Importance:")
print()
for i, j in zip(dtr.feature_names_in_, dtr.feature_importances_):
    print(i+": "+str(round(j,2)))

In [None]:
# Tree Visualization

from sklearn.tree import plot_tree

plt.figure(figsize=(18,18))
plot_tree(dtr,filled=True,fontsize=8,feature_names=list(xtrainr.columns),max_depth=5)
plt.show()

**4.4.7 Random Forest Regressor**

In [None]:
rfr.fit(xtrainr, ytrainr)

In [None]:
# Model Params

print("Model Params:")
print(rfr.get_params())
print()
print("Columns Importance:")
print()
for i, j in zip(rfr.feature_names_in_, rfr.feature_importances_):
    print(i+": "+str(round(j,2)))

In [None]:
# One Tree Visualization

from sklearn.tree import plot_tree

plt.figure(figsize=(18,18))
plot_tree(rfr.estimators_[0],filled=True,fontsize=8,feature_names=list(xtrainr.columns),max_depth=5)
plt.show()

**4.4.8 Xgb Regressor**

In [None]:
# Model Training

xgbr.fit(xtrainr, ytrainr)

In [None]:
xgbr.get_params()

****4.5 & 4.6 Predictions & Evaluations****

* Checking trained model performance with test data
* Using xtest data we will be getting predictions, these predictions will be compared to ytest
    - To compare we can use below metrics,
        - Loss Metric: **RMSE**
        - Score/Performance Metric: **R2Score**
* **For regression models , we have following to check model performance**

Technique | Outcome
-------|-----------
**Bias-Variance Trade Off**|Model Fitness Based on Train & Test Metrics
**Crossvalidation**|Checking Trained model performance on entire X and y data

In [None]:
# Libraries

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

* Looping through each trained model for test predictions, & evalaution

In [None]:
names = ['Multiple Linear Regression','Polynomial Regression','Lasso Regression','Ridge Regression',
         'Knn Regressor', 'Svm Regressor', 'Decision Tree Regressor', 'RandomForest Regressor', 'Xgboost Regressor']
models = {'mlr':mlr, 'poly':poly, 'lasso':lasso, 'ridge':ridge,  'knn':knnr, 'svm':svr, 'dt':dtr, 'rf':rfr, 'xgb':xgbr}

In [None]:
# Taking User-Defined Module

from mlevalr import regval

In [None]:
trainrmse, testrmse, trainr2, testr2, crossvalscore, fit = regval(xtrainr, xtrainr_poly, xtestr, xtestr_poly, ytrainr, ytestr, models)

In [None]:
# Complete Model Evaluation Table
# Display Options for Table
pd.set_option('display.float_format', lambda x: '%.3f' % x)

display(pd.DataFrame({'Model':names, 'TrainRMSE':trainrmse, 'TestRMSE':testrmse,
             'Trainscore':trainr2, 'Testscore':testr2, 'Crossvalscore':crossvalscore, 'Fit':fit}))

**4.7 Best Model**

In [None]:
# Taking output col distribution

Y.describe()

* By observing y col distribution it is clear that from the above table, test rmse values are very low for xgboost regressor
* According to above table , we can select Xgboost Regressor is better performance model for this data (low test rmse , high r2score, matching cross val score to test)

**4.8 Hyp Param Tuning**

**4.9 Saving Model**

* Saving Trained & Evaluated model for future predictions
    * From above training we need to save xgbr & ohe, sc objects
    * In python we have libraries , joblib, pickle to save model files

In [None]:
import joblib

# Saving trained model
joblib.dump(xgbr, 'xgbr.pkl')
# Saving onehot encoded model

joblib.dump(ohe, 'ohe.pkl')

joblib.dump(sc, "sc.pkl")

trained_feature_names = X.columns.tolist()
import pickle

with open("feature_names.pkl","wb") as f:
    pickle.dump(trained_feature_names,f)


In [None]:
# Loading saved pickles and getting predictions

import joblib
xgbr= joblib.load('xgbr.pkl')
feature_names = xgbr.feature_names_in_.tolist()


sc = joblib.load("sc.pkl")

# Loading Saved object files
ohe = joblib.load('ohe.pkl')
   

# Loading Saved Ordinal Encoded files
with open('Day_encoding.pkl', 'rb') as f:
    Day_encoding = pickle.load(f)
    
with open('CourseName_encoding.pkl', 'rb') as f:
    CourseName_encoding = pickle.load(f)
with open("feature_names.pkl",'rb') as f:
    trained_feature_names = pickle.load(f)


In [None]:
import pandas as pd
import numpy as np

def LCSofSocialMedia_Post():
    print("Reference Data for Input:")
    
    # Load input data
    inpdata = pd.read_csv("inputdata1.csv")
    display(inpdata.head())

    print("\n Xgboost  regressor built on the below X columns:")
    print(inpdata.columns)

    print("\n======================= Enter User Input Data ====================")

    # User Inputs
    print("\nEnter Institute Name:")
    print(inpdata['Institute Name'].unique())
    insti = input("Select institute: ").strip()

    print("\nEnter Course Name:")
    print(inpdata['CourseName'].unique())
    course = input("Select course: ").strip()

    print("\nEnter Followers:")
    followers = eval(input(f"min-{inpdata['Followers'].min()}, max-{inpdata['Followers'].max()}: "))

    print("\nEnter type of post:")
    print(inpdata['type'].unique())
    typeofpost = input("Select type: ").strip()

    print("\nEnter Location:")
    print(inpdata['Location'].unique())
    location = input("Select location: ").strip()

    print("\nEnter Day:")
    print(inpdata['Day'].unique())
    day = input("Select day: ").strip()

    print("\nEnter Reach:")
    print(inpdata['Reach'].unique())
    reach = input("Select Reach: ").strip()

    # Create user input DataFrame
    row = pd.DataFrame([[insti, course, followers, typeofpost, location, day,reach]], columns=inpdata.columns)

    print("\nGiven User Input Data:")
    display(row)

    ####### Data Pre-Processing #######

    # Binary Encoding
    row['type'].replace({'post': 0, 'reel': 1}, inplace=True)
    row['Location'].replace({'hyderabad': 0, 'banglore': 1}, inplace=True)
    row['Reach'].replace({'low': 0, 'moderate': 1,'high':2}, inplace=True)

   # Ordinal Encoding
    row['Day'].replace(Day_encoding, inplace=True)
    row['CourseName'].replace(CourseName_encoding, inplace=True)
    
    
    # One-Hot Encoding
   
    if 'Institute Name' in row.columns:
        ohedata = ohe.transform(row[['Institute Name']]).toarray()
        ohedata = pd.DataFrame(ohedata, columns=ohe.get_feature_names_out())

   
        # Add missing columns if necessary
        missing_cols = set(trained_feature_names) - set(ohedata.columns)
        for col in missing_cols:
            ohedata[col] = 0  # Add missing columns with value 0

    # Ensure column order matches training
        ohedata = ohedata[trained_feature_names]

    # Merge with X
        X = pd.concat([row.drop(['Institute Name'], axis=1), ohedata], axis=1)
    else:
        raise KeyError("Column 'Institute Name' not found in DataFrame X. Available columns: ", row.columns)

    #Scaling
    row[['Followers']] = sc.transform(row[['Followers']])

    # Ensure feature alignment with the trained model
    expected_features = xgbr.feature_names_in_
    row = row.reindex(columns=feature_names, fill_value=0)

    print("\nProcessed features for prediction")
    print("********** Xgboost Regressor ***********")
    
    # Prediction
    try:
    # Predict Likes, Comments, and Shares
       predicted_values = xgbr.predict(row)[0]
    
       print("Likes:",round(predicted_values[0],2),"Shares:",round(predicted_values[1],2),"Comments:",round(predicted_values[2],2))

    except Exception as e:
      print("\nError during prediction:", e)




In [None]:
LCSofSocialMedia_Post()