In [None]:
'''Q1: MovieLens 1M Dataset GroupLens Research provides a number of collections of movie ratings data
    collected from users of MovieLens in the late 1990s and early 2000s. The data provide movie ratings, 
    movie metadata (genres and year), and demographic data about the users (age, zip code, gender identification,
    and occupation). Such data is often of interest in the development of recommendation systems based on machine learning
    algorithms. While we do not explore machine learning techniques in detail in this book, I will show you how to
    slice and dice datasets like these into the exact form you need. The MovieLens 1M dataset contains 1 million 
    ratings collected from 6,000 users on 4,000 movies. It’s spread across three tables: ratings, user information,
    and movie information. After extracting the data from the ZIP file, we can load each table into a pandas DataFrame
    object using pandas.read_table and perform the following task.
    
1.    Perform null values identification in the given dataset.

2.    Identify types of attributes in the dataset.

3.    Plot Box plot and violin plot. (also state the inference of each attribute and also find the outlier in the attribute)

4.    Histogram and identification of overlapping.(also state the inference for each attribute.)

5.    Draw different types of scatter plot.(using seaborn library) 

6.    Univariate and multivariate analysis.'''

import pandas as pd

# Load ratings, users and movies data into separate pandas DataFrame objects
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python')
users = pd.read_table('users.dat', sep='::', header=None, names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'], engine='python')
movies = pd.read_table('movies.dat', sep='::', header=None, names=['MovieID', 'Title', 'Genres'], engine='python')

# Identify null values in each DataFrame
print(ratings.isnull().sum())
print(users.isnull().sum())
print(movies.isnull().sum())



'''Types of Attributes:
The MovieLens 1M dataset contains the following types of attributes:

UserID: integer
MovieID: integer
Rating: integer
Timestamp: integer
Gender: categorical (M or F)
Age: integer
Occupation: categorical (0-20)
Zip-code: string
Title: string
Genres: categorical (pipe-separated list of genres
'''


import seaborn as sns

# Create box plot for rating across gender categories
sns.boxplot(x='Gender', y='Rating', data=pd.merge(ratings, users))

# Create violin plot for rating across age categories
sns.violinplot(x='Age', y='Rating', data=pd.merge(ratings, users))


import matplotlib.pyplot as plt
# Create histograms for age and rating
users['Age'].hist(bins=20)
ratings['Rating'].hist(bins=5)

# Create overlapping histograms for rating across gender categories
ratings[ratings['Gender']=='M']['Rating'].hist(bins=5, alpha=0.5)
ratings[ratings['Gender']=='F']['Rating'].hist(bins=5, alpha=0.5)
plt.legend(['Male', 'Female'])



import pandas as pd
import seaborn as sns

# Load ratings data
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'])

# Draw a scatter plot of Rating vs Timestamp
sns.scatterplot(x='Timestamp', y='Rating', data=ratings)



import pandas as pd
import seaborn as sns

# Load ratings data
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'])

# Print basic statistics of Rating
print(ratings['Rating'].describe())

# Draw a histogram of Rating
sns.histplot(x='Rating', data=ratings, bins=10)

# Draw a box plot of Rating
sns.boxplot(x='Rating', data=ratings)

# Draw a scatter plot of Rating vs Timestamp
sns.scatterplot(x='Timestamp', y='Rating', data=ratings)

# Compute correlation between Rating and Timestamp
print(ratings['Rating'].corr(ratings['Timestamp']))


In [None]:
'''Q2: Diabetics datasets :                                                                                                         (5 marks)

 Data Exploration: This includes inspecting the data, visualizing the data, and cleaning the data. Some of the steps used are as follows:

1. Viewing the data statistics.

2. Finding out the dimensions of the dataset, the variable names, the data types, etc.

3. Checking for null values.

4. Inspecting the target variable using pie plot and count plot.

5. Finding out the correlation among different features using heatmap and the bivariate relation between each pair of features using pair plot.

Model Training: 5 Classification Algorithms have been used to find out the best one. These are Logistic Regression, Support Vector Machine, Random Forest, K-Nearest Neighbours, and Naive Bayes.

In each of the algorithms, the steps followed are as follows:

1. Importing the library for the algorithm.

2. Creating an instance of the Classifier (with default values of parameters or by specifying certain values in certain cases).

3. Training the model on the train set.

4. Prediction on the test set using the trained model.

5. Calculating the accuracy of the prediction.
'''

df.describe()

print(df.shape) # dimensions of the dataset
print(df.columns) # variable names
print(df.info()) # data types

print(df.isnull().sum()) # number of null values in each variable

print(df['Outcome'].value_counts()) # distribution of the target variable
plt.pie(df['Outcome'].value_counts(), labels=['Non-diabetic', 'Diabetic'], autopct='%1.1f%%') # pie plot
sns.countplot(x='Outcome', data=df) # count plot

corr = df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=True, fmt='.2f')

sns.pairplot(df, hue='Outcome')