<a href="https://colab.research.google.com/github/MattiaTarantino/Machine-Learning-project/blob/main/Fondamenti_di_IA_homework_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### For the homeworks we are going to use the "[Online News Popularity Data Set](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity#)"

The dataset can be used both for regression and classification tasks.

#### Source:

Kelwin Fernandes INESC TEC, Porto, Portugal/Universidade do Porto, Portugal.
Pedro Vinagre ALGORITMI Research Centre, Universidade do Minho, Portugal
Paulo Cortez ALGORITMI Research Centre, Universidade do Minho, Portugal
Pedro Sernadela Universidade de Aveiro

#### Data Set Information:

* The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls.
* Acquisition date: January 8, 2015
* The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.

Attribute Information:

Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

Attribute Information:
0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)


The first two columns (url and time_delta) are non-predictive and should be ignored

The last column **shares** contains the value to predict.

### Regression
In the case of regression we want to predict the value of the share column.

### Classification
In the case of classification we want to predict one of two classes:

* *low* -- shares < 1,400
* *high* -- shares >= 1,400

### Metrics

#### Regression
To evaluate how good we are doing on the **regression** task we will use the Root Mean Squared Error (RMSE). RMSE is given by

$$
\sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\Big(d_i -f_i\Big)^2}}
$$


where:

* $n$ is the number of test samples
* $d_i$ is the ground truth value of the i-th sample
* $f_i$ is the predicted value of the i-th sample


#### Classification
To evaluate how good we are doing on the **classification** task we will use the accuracy metrics. Accuracy is given by

$$
\frac{TP+TN}{TP+TN+FP+FN}
$$

where:

* TP is the number of *correctly* classified positive samples
* TN is the number of *correctly* classified negative samples
* FP is the number of *incorrectly* classified positive samples
* FN is the number of *incorrectly* classified negative samples

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip

In [None]:
!unzip OnlineNewsPopularity.zip

In [None]:
import pandas as pd
import numpy as np

Format properly the names of the columns and remove the first two columns

In [None]:
df = pd.read_csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.iloc[: , 2:]
# Adding column values to high(1) and low(0) in base of shares for classification 
df['class'] = np.where(df['shares'] >= 1400, 'high', 'low')

In [None]:
df


#  Classification 




In [None]:
class Node():
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None): 
        # for decision node
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.info_gain = info_gain
        
        # for leaf node
        self.value = value

In [None]:
class DecisionTreeClassifier():
      def __init__(self, min_samples_split=2, max_depth=3):
        
        # initialize the root of the tree 
        self.root = None
        
        # stopping conditions
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth

      def build_tree(self, dataset, curr_depth=0):
        
        x, y = dataset.iloc[: , :58], dataset['class']
        num_samples, num_features = len(x.columns), len(x)
        
        # split until stopping conditions are met
        if num_samples>=self.min_samples_split and curr_depth<=self.max_depth:
            # find the best split
            best_split = self.get_best_split(dataset, num_samples, num_features)
            # check if information gain is positive
            if best_split["info_gain"]>0:
                # recur left
                left_subtree = self.build_tree(best_split["dataset_left"], curr_depth+1)
                # recur right
                right_subtree = self.build_tree(best_split["dataset_right"], curr_depth+1)
                # return decision node
                return Node(best_split["feature_index"], best_split["threshold"], 
                            left_subtree, right_subtree, best_split["info_gain"])
        
        # compute leaf node
        leaf_value = self.calculate_leaf_value(y)
        # return leaf node
        return Node(value=leaf_value)


      # Funzione per calcolare l'entropia
      def entropy(self, y):
          h = 0
          if len(y) > 0:
            p = y.value_counts()[0]
            n = y.value_counts()[1]
            q = p/(p+n)
            h = - (q*np.log2(q) + (1-q)*np.log2(1-q))
          return h

      # Funzione per dividere i dati 
      def split(self, df, feature_index, threshold):
          df_left = df[df.iloc[:,feature_index] <= threshold]
          df_right = df[df.iloc[:,feature_index] > threshold]
          return df_left, df_right

      # Funzione per calcolare l'information gain
      def information_gain(self, parent, l_child, r_child, mode="entropy"):
          weight_l = len(l_child) / len(parent)
          weight_r = len(r_child) / len(parent)
          #if mode=="gini":
              #gain = gini_index(parent) - (weight_l*gini_index(l_child) + weight_r*gini_index(r_child))
          #else:
          gain = self.entropy(parent) - (weight_l*self.entropy(l_child) + weight_r*self.entropy(r_child))
          return gain

      def get_best_split(self, dataset, num_samples, num_features):
          best_split = {}
          max_info_gain = -float("inf")
          for feature_index in range(58):
              feature_values = df.iloc[:,feature_index]
              median = feature_values.median()
              df_left, df_right = self.split(df, feature_index, median)
              if len(df_left)>0 and len(df_right)>0:
                  y, left_y, right_y = df.iloc[: , 59:], df_left.iloc[: , 59:], df_right.iloc[: , 59:]
                  curr_info_gain = self.information_gain(y, left_y, right_y)
                  if curr_info_gain>max_info_gain:
                      best_split["feature_index"] = feature_index
                      best_split["threshold"] = median
                      best_split["dataset_left"] = df_left
                      best_split["dataset_right"] = df_right
                      best_split["info_gain"] = curr_info_gain
                      max_info_gain = curr_info_gain

          return best_split

      def calculate_leaf_value(self, Y):
        Y = Y.tolist()
        return max(Y, key=Y.count)

      def fit(self, dataset):
        self.root = self.build_tree(dataset)

      def predict(self, X):
        # Function to predict new dataset 
        preditions = []
        for i in range(len(X)):
          preditions.append(self.make_prediction(X.iloc[i,:], self.root))
        return preditions

      def make_prediction(self, x, tree):
        # Function to predict a single data point
        if tree.value != None:
           return tree.value
        feature_val = x[tree.feature_index]     
        if feature_val <= tree.threshold:
            return self.make_prediction(x, tree.left)
        else:
            return self.make_prediction(x, tree.right)



In [None]:
from sklearn.model_selection import train_test_split
x, y = df.iloc[: , :58], df['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)


In [None]:
classifier = DecisionTreeClassifier(min_samples_split=3, max_depth=3)
classifier.fit(df)

In [None]:
y_pred = classifier.predict(x_test) 
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)*100)

Accuracy: 59.74271660991298


In [None]:
#Evaluation using Confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,y_pred)

array([[2519, 1726],
       [1466, 2218]])



#  Sklearn testing - Classification




In [None]:
# Samples and labels for classification
x, y = df.iloc[: , :58], df['class']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree classifer object
model = DecisionTreeClassifier(criterion="entropy", max_depth = 1)

# Train Decision Tree Classifer
model = model.fit(x_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(x_test)

In [None]:
#Evaluation using Accuracy score
from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred)*100)

Accuracy: 59.78055240257283


In [None]:
#Evaluation using Confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,y_pred)

array([[2600, 1645],
       [1605, 2079]])