<a href="https://colab.research.google.com/github/Ajaypuppala02/BIG-DATA-ANALYTICS/blob/main/Most_Visited_Cities_In_India.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import required libraries

In [1]:
from pyspark.sql import SparkSession

In [2]:
from pyspark.sql.functions import col , count , sum , avg , desc , isnull , when

In [3]:
from pyspark.ml.feature import VectorAssembler , StringIndexer

In [4]:
from pyspark.ml.regression import LinearRegression

In [5]:
from pyspark.ml.classification import RandomForestClassifier

In [6]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator , RegressionEvaluator

In [7]:
import matplotlib.pyplot as plt


# Initialize Spark session

In [8]:
spark = SparkSession.builder.appName("PredictiveAnalysis").getOrCreate()

Load dataset into Spark DataFrame

In [9]:
CD = spark.read.csv("/content/most visited cities in India.csv", header=True, inferSchema=True)

Display basic dataset information

In [10]:
CD.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Rating: double (nullable = true)
 |-- About the city (long Description): string (nullable = true)
 |-- Best Time to visit: string (nullable = true)



In [11]:
print("Total rows in Dataset:", CD.count())

Total rows in Dataset: 71


Handle missing values - Drop rows with missing values

In [12]:
CD = CD.dropna()

In [13]:
print ("Total rows after dropping missing values:" , CD.count())

Total rows after dropping missing values: 71


Feature Selection - Choosing relevant features for prediction &
Assuming "Id" as a feature and "Rating" as the target variable

In [15]:
CD = CD.select("Id", "Rating")

 Convert the target column to numeric if necessary

In [16]:
CD = CD.withColumn("Rating", col("Rating").cast("double"))

Feature Engineering - Vector Assembler

In [17]:
feature_assembler = VectorAssembler(inputCols=["Id"], outputCol="features")

In [18]:
CD = feature_assembler.transform(CD)

Split data into training and testing sets (80-20 split)

In [19]:
train_data, test_data = CD.randomSplit([0.8, 0.2], seed=42)

Train a Regression Model

In [20]:
LR = LinearRegression(featuresCol="features", labelCol="Rating")

In [21]:
model = LR.fit(train_data)

Make Predictions

In [22]:
predictions = model.transform(test_data)

In [23]:
predictions.show(10)

+---+------+--------+------------------+
| Id|Rating|features|        prediction|
+---+------+--------+------------------+
|  3|   4.4|   [3.0]| 4.601987358601456|
|  7|   4.9|   [7.0]| 4.583464660973526|
|  9|   4.2|   [9.0]| 4.574203312159561|
| 14|   4.5|  [14.0]| 4.551049940124649|
| 20|   4.8|  [20.0]| 4.523265893682754|
| 24|   4.6|  [24.0]| 4.504743196054824|
| 30|   4.3|  [30.0]| 4.476959149612929|
| 36|   4.5|  [36.0]| 4.449175103171035|
| 46|   4.7|  [46.0]|  4.40286835910121|
| 47|   4.6|  [47.0]|4.3982376846942275|
+---+------+--------+------------------+
only showing top 10 rows



 Evaluate the Model

In [26]:
reg_evaluator = RegressionEvaluator(labelCol="Rating", predictionCol="prediction", metricName="rmse")

In [27]:
rmse= reg_evaluator.evaluate(predictions)

In [28]:
print(f"Root Mean Squared Error (RMSE): {rmse}")

Root Mean Squared Error (RMSE): 0.21073226506393167


Save Model

In [29]:
model.save("/mnt/data/tourist_prediction_model")

In [30]:
print("Predictive analysis using PySpark ML completed successfully!")

Predictive analysis using PySpark ML completed successfully!
