# <center>Programming for Data Analysis</center>

## Project Objective:

In this project we have to create a dataset by simulating a real world phenomenon of our choice. And have to model and synthesise such data using Python package of numpy.random


## Project Scope:

1. Dataset should be at least one-hundred data points across at least four different variables.

2. Investigate the types of variables involved, their likely distributions, and their relationships with each other.

3. Synthesise/simulate a data set as closely matching their properties as possible.

4. Detail your research and implement the simulation in a Jupyter notebook.

### Let's get started !!!

## 1. Importing the required libraries

In [1]:
# Importing required libraries.
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
from matplotlib import pyplot as plt #visualisation
import warnings
warnings.filterwarnings('ignore')

## 2. Selection of Dataset and Attributes

For this project i will create a song track dataset which can be used for real problems like music recommendation, rating prediction of a song, top 10 songs of the year. The columns of the data set will be 6 and the total rows will be 200. The column attributes are as follows:

Song_id = int
&nbsp; #Unique ID for every song in the dataset, in total there are 200 songs in the dataset

Listen_count = int 
&nbsp; #Number of times a song was listened

Year = int 
[2015, 2020]
&nbsp; #Release year of the song track

Genre = str
&nbsp; [pop, rock, jazz, hiphop, disco, folk]

Downloads = int
&nbsp; #Number of downloads of each song

Rating = float
&nbsp; #Average rating of each song


## 3. Generating Dataset using Numpy Random Module

We will use normal distribution for listen counts and number of downloads and for ratings we'll use unifrom distribution from 1 to 5. The year and genre will be randomly selected using randint and random choice.

In [2]:
# total number of rows:
rows = 200

# list arrays for dataset column attributes:
song_id = []
listen_counts = []
year = []
genre = []
no_downloads = []
rating = []

np.random.seed(22)

for i in range(rows):
    song_id.append(i+1)
    
    genre.append(np.random.choice(["pop", "rock", "jazz", "hiphop", "disco", "folk"]))
    
    if genre[i] in ["rock", "jazz", "disco"]:
        listen_counts.append(int(np.random.normal(loc=2500, scale=625)))
        
        no_downloads.append(int(np.random.normal(loc=5000, scale=1250)))
    
    else:
        listen_counts.append(int(np.random.normal(loc=1250, scale=312.5)))
        
        no_downloads.append(int(np.random.normal(loc=2500, scale=625))) 
    
    year.append(np.random.randint(low = 2015, high= 2020))
    
    if year[i] in [2015, 2016, 2017]:
        rating.append(round(np.random.uniform(3.1,5), 1))
    else:
        rating.append(round(np.random.uniform(1,3), 1))

## 4. Creating DataFrame For Better view of Dataset

In [3]:
# column name list
col_names = ['Song_Id', 'Genre', 'Listen_Counts', 'Year', 'No_of_Downloads', 'Ratings']

# create an empty dataframe
# with columns
df = pd.DataFrame(columns=col_names)

try:
    df['Song_Id']         = pd.Series(song_id)
    df['Genre']           = pd.Series(genre)
    df['Listen_Counts']   = pd.Series(listen_counts)
    df['Year']            = pd.Series(year)
    df['No_of_Downloads'] = pd.Series(no_downloads)
    df['Ratings']         = pd.Series(rating)

except Exception as e:
    print("Error:",e)

## 5. Displaying Dataset:

In [4]:
# show the dataframe starting 5 rows
df.head(5)

Unnamed: 0,Song_Id,Genre,Listen_Counts,Year,No_of_Downloads,Ratings
0,1,folk,1578,2018,3076,1.3
1,2,disco,3074,2015,3620,4.6
2,3,jazz,3320,2018,5320,1.9
3,4,folk,919,2018,3159,2.4
4,5,jazz,1740,2019,7414,2.4


In [5]:
# last 5 rows
df.tail()

Unnamed: 0,Song_Id,Genre,Listen_Counts,Year,No_of_Downloads,Ratings
195,196,hiphop,1735,2019,1456,2.8
196,197,jazz,1863,2016,6877,4.9
197,198,pop,1402,2015,2653,3.7
198,199,jazz,2099,2015,6095,4.6
199,200,pop,1455,2016,2747,3.9
