In [None]:
'''Q1: MovieLens 1M Dataset GroupLens Research provides a number of collections of movie ratings data
    collected from users of MovieLens in the late 1990s and early 2000s. The data provide movie ratings, 
    movie metadata (genres and year), and demographic data about the users (age, zip code, gender identification,
    and occupation). Such data is often of interest in the development of recommendation systems based on machine learning
    algorithms. While we do not explore machine learning techniques in detail in this book, I will show you how to
    slice and dice datasets like these into the exact form you need. The MovieLens 1M dataset contains 1 million 
    ratings collected from 6,000 users on 4,000 movies. It’s spread across three tables: ratings, user information,
    and movie information. After extracting the data from the ZIP file, we can load each table into a pandas DataFrame
    object using pandas.read_table and perform the following task.
    
1.    Perform null values identification in the given dataset.

2.    Identify types of attributes in the dataset.

3.    Plot Box plot and violin plot. (also state the inference of each attribute and also find the outlier in the attribute)

4.    Histogram and identification of overlapping.(also state the inference for each attribute.)

5.    Draw different types of scatter plot.(using seaborn library) 

6.    Univariate and multivariate analysis.
_________________________________________________ '''


#1
import pandas as pd

# Load ratings, users and movies data into separate pandas DataFrame objects
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python')
users = pd.read_table('users.dat', sep='::', header=None, names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'], engine='python')
movies = pd.read_table('movies.dat', sep='::', header=None, names=['MovieID', 'Title', 'Genres'], engine='python')

# Identify null values in each DataFrame
print(ratings.isnull().sum())
print(users.isnull().sum())
print(movies.isnull().sum())


'''2
Types of Attributes:
The MovieLens 1M dataset contains the following types of attributes:

UserID: integer
MovieID: integer
Rating: integer
Timestamp: integer
Gender: categorical (M or F)
Age: integer
Occupation: categorical (0-20)
Zip-code: string
Title: string
Genres: categorical (pipe-separated list of genres
'''

#3
import matplotlib.pyplot as plt

# Box plot of ratings
ratings.boxplot(column=['rating'])
plt.show()

# Violin plot of age
users.violinplot(column=['age'])
plt.show()

# Identify outliers in the rating column
q1 = ratings['rating'].quantile(0.25)
q3 = ratings['rating'].quantile(0.75)
iqr = q3 - q1
outliers = ratings[(ratings['rating'] < q1 - 1.5*iqr) | (ratings['rating'] > q3 + 1.5*iqr)]
print(outliers)


#4
import seaborn as sns

# Histogram of ages
users.hist(column=['age'])
plt.show()

# KDE plot of ratings for males and females
sns.kdeplot(ratings[ratings['gender']=='M']['rating'], label='Male')
sns.kdeplot(ratings[ratings['gender']=='F']['rating'], label='Female')
plt.xlabel('Rating')
plt.show()


#5
import seaborn as sns

# Scatter plot of ratings vs. timestamps
sns.scatterplot(x='timestamp', y='rating', data=ratings)
plt.show()

# Scatter plot of age vs. ratings
sns.scatterplot(x='age', y='rating', data=users)
plt.show()

# Scatter plot of rating vs. genre
genre_ratings = ratings.merge(movies, on='movie_id')[['genres', 'rating']]
genre_ratings = genre_ratings.assign(genre=genre_ratings['genres'].str.split('|')).explode('genre')
sns.scatterplot(x='genre', y='rating', data=genre_ratings)
plt.xticks(rotation=90)
plt.show()


#6import seaborn as sns

# Univariate analysis of ratings
print(ratings['rating'].describe())

# Multivariate analysis of ratings, age, and gender
sns.boxplot(x='rating', y='age', hue='gender', data=ratings.merge(users, on='user_id'))
plt.show()

# Multivariate analysis of ratings and genres
genre_ratings = ratings.merge(movies, on='movie_id')[['genres', 'rating']]
genre_ratings = genre_ratings.assign(genre=genre_ratings['genres'].str.split('|')).explode('genre')
sns.boxplot(x='genre', y='rating', data=genre_ratings)
plt.xticks(rotation=90)
plt.show()


In [None]:
'''Q2: Diabetics datasets :                                                                                                         (5 marks)

 Data Exploration: This includes inspecting the data, visualizing the data, and cleaning the data. Some of the steps used are as follows:

1. Viewing the data statistics.

2. Finding out the dimensions of the dataset, the variable names, the data types, etc.

3. Checking for null values.

4. Inspecting the target variable using pie plot and count plot.

5. Finding out the correlation among different features using heatmap and the bivariate relation between each pair of features using pair plot.

Model Training: 5 Classification Algorithms have been used to find out the best one. These are Logistic Regression, Support Vector Machine, Random Forest, K-Nearest Neighbours, and Naive Bayes.

In each of the algorithms, the steps followed are as follows:

1. Importing the library for the algorithm.

2. Creating an instance of the Classifier (with default values of parameters or by specifying certain values in certain cases).

3. Training the model on the train set.

4. Prediction on the test set using the trained model.

5. Calculating the accuracy of the prediction.
'''
df.describe()

print(df.shape) # dimensions of the dataset
print(df.columns) # variable names
print(df.info()) # data types

print(df.isnull().sum()) # number of null values in each variable

print(df['Outcome'].value_counts()) # distribution of the target variable
plt.pie(df['Outcome'].value_counts(), labels=['Non-diabetic', 'Diabetic'], autopct='%1.1f%%') # pie plot
sns.countplot(x='Outcome', data=df) # count plot

corr = df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=True, fmt='.2f')

sns.pairplot(df, hue='Outcome')