
I have a dataset titled 'music.csv'. Let's start by loading the dataset and checking what it contains. This will give me a better understanding of the data that i am dealing with and will help me decide what kind of machine learning model i can use for prediction.

In [2]:
import pandas as pd
from google.colab import drive

In [4]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
data=pd.read_csv('/content/drive/MyDrive/ML_TTU /music.csv')

In [7]:
data.head()

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz


The dataset consists of three columns:

* 'age': Presumably representing the age of a person.
* 'gender': A categorical variable likely denoting the gender of a person (typically, 1 might represent males and 0 females, or vice versa).
* 'genre': The music genre that a person of a particular age and gender prefers.

This is a typical classification problem where, given age and gender, I need to predict the preferred music genre.

Let's move on and do a bit of exploratory data analysis to understand the data better. I will look at the distribution of age, gender, and genre. I will also check for any missing values in the data.


In [8]:
missing_values = data.isnull().sum()

In [9]:
age_distribution = data['age'].describe()
gender_distribution = data['gender'].value_counts()
genre_distribution = data['genre'].value_counts()

missing_values, age_distribution, gender_distribution, genre_distribution

(age       0
 gender    0
 genre     0
 dtype: int64,
 count    18.000000
 mean     27.944444
 std       5.127460
 min      20.000000
 25%      25.000000
 50%      28.000000
 75%      31.000000
 max      37.000000
 Name: age, dtype: float64,
 1    9
 0    9
 Name: gender, dtype: int64,
 Classical    6
 HipHop       3
 Jazz         3
 Dance        3
 Acoustic     3
 Name: genre, dtype: int64)

Here are some insights from the data:

* Missing Values: There are no missing values in the dataset. This is good as it means i don't need to perform any imputation or deletion of missing values.
* Age Distribution: The age of individuals in the dataset ranges from 20 to 37 years, with an average age of around 28 years.
* Gender Distribution: The dataset is balanced in terms of gender, with equal instances of both categories (presumably male and female).
* Genre Distribution: The dataset contains five genres of music. 'Classical' has the most instances, while 'HipHop', 'Jazz', 'Dance', and 'Acoustic' are equally represented.
Given this information, I can move on to the next step of preparing our data for model building. I will need to encode the 'genre' column, as machine learning models work best with numerical data.



I'll be practicing machine learning tools, building, training, and testing models, and making predictions.

Let's proceed with the next steps:

* Encode the 'genre' column to convert it from categorical to numerical data.
* Split the dataset into a training set and a test set.
* Build and train a machine learning model using the training set.
* Evaluate the model's performance using the test set.
* Use the model to predict new cases.

Let's start with the first step

In [10]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [11]:
le = LabelEncoder()

In [12]:
data_encoded = data.copy()

In [13]:
data_encoded['genre'] = le.fit_transform(data_encoded['genre'])

In [14]:
X = data_encoded[['age', 'gender']]
y = data_encoded['genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

data_encoded.head(), X_train.head(), y_train.head()

(   age  gender  genre
 0   20       1      3
 1   23       1      3
 2   25       1      3
 3   26       1      4
 4   29       1      4,
     age  gender
 3    26       1
 13   27       0
 16   34       0
 15   31       0
 11   25       0,
 3     4
 13    0
 16    1
 15    1
 11    2
 Name: genre, dtype: int64)

I've successfully encoded the 'genre' column into numerical values, and split the dataset into training and test sets.


In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [16]:
model = DecisionTreeClassifier(random_state=42)

In [17]:
model.fit(X_train, y_train)

In [18]:
y_pred = model.predict(X_test)

In [19]:
accuracy = accuracy_score(y_test, y_pred)

accuracy

1.0

The model's accuracy on the test set is 1.0, which means it made correct predictions for all test instances. This is a great result, although it might be a bit suspicious. It could indicate that our model is overfitting the training data, especially given the small size of our dataset. Overfitting is a common problem in machine learning where the model performs exceptionally well on the training data but fails to generalize to unseen data.

To further validate the performance of our model, We could collect or generate more data and test the model on it.

However, as I only have this dataset at the moment, let's proceed to use this model to make predictions on new cases. For example, I can predict the preferred music genre of a 21-year-old male and a 30-year-old female.

Remember that I used a label encoder to encode the genres into numerical values. I'll need to decode the predictions to get the actual genre names. Let's do that now.

In [20]:
new_cases = pd.DataFrame({
    'age': [21, 30],
    'gender': [1, 0]  # Assuming 1 is male and 0 is female
})

In [21]:
new_predictions = model.predict(new_cases)


In [22]:
decoded_predictions = le.inverse_transform(new_predictions)

decoded_predictions

array(['HipHop', 'Acoustic'], dtype=object)

The model predicts that a 21-year-old male would prefer 'HipHop' and a 30-year-old female would prefer 'Acoustic' music.

Please note that these predictions are based on the patterns the model has learned from the given dataset. The accuracy of these predictions in real-world scenarios would depend on how well the dataset represents the actual population.

To conclude, I have successfully practiced using machine learning tools to load and preprocess data, build and train a decision tree model, and use this model to make predictions.