## Data collection

### Extract each type of news

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys 
import time

In [36]:
driver = webdriver.Chrome()
driver.get("https://news.google.com/home?hl=en-IN&gl=IN&ceid=IN:en")
buttons =  driver.find_elements(By.CLASS_NAME,"EctEBd")

heading = ""
data = []

for button in buttons:
    data_n_ini = button.get_attribute("data-n-ini")
        
    try:
        
        if data_n_ini.isdigit() and  8 <= int(data_n_ini) and 13 >= int(data_n_ini) : 
            a = button.find_element(By.TAG_NAME, 'a')
            a.click()
            print("\n",a.text)
            heading = a.text
            time.sleep(5)
            #
            labels = driver.find_elements(By.CLASS_NAME,"gPFEn")
            
            for lab in labels:
                if lab.text != "":
                    # print(lab.text,"\n")
                    data.append({'label':heading,'content':lab.text})
            
    
    except ValueError:
        continue
driver.quit()


 Business 


 Technology 


 Entertainment 


 Sports 


 Science 


 Health 



In [38]:
data[0]

{'label': 'Business',
 'content': 'Union Budget 2025: Electric vehicles to become affordable'}

### Write data into csv

In [39]:
import csv

In [40]:
csv_file = "newsdataset.csv"

with open(csv_file, mode = "w" , newline="", encoding = "utf-8") as file:
    filenames = ['label','content']

    writer = csv.DictWriter(file,fieldnames = filenames)

    writer.writeheader()

    writer.writerows(data)


print("done")

done


## Training

In [48]:
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [50]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score


In [51]:
df = pd.read_csv('newsdataset.csv')
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
label_encoder = LabelEncoder()
df['label_encoded'] = label_encoder.fit_transform(df['label'])
df.head(5)

Unnamed: 0,label,content,label_encoded
0,Business,Maruti Suzuki's vehicle dispatches from factor...,0
1,Business,Non-tax revenue from telecom pegged 33 pc lowe...,0
2,Health,AI model predicts dengue outbreaks two months ...,2
3,Health,Handful of California Almonds a Day: Natural A...,2
4,Entertainment,Sankranthiki Vasthunnam team plans a grand fin...,1


In [47]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
tokenizer = Tokenizer(num_words = 100,oov_token="<OOV>")

In [None]:
To create a model that classifies the type of news based on the content, using the label and content of news articles, we'll follow these general steps:

1. **Load and preprocess the data** (text and labels).
2. **Split the data** into training and testing sets.
3. **Vectorize the text data** (convert news content to numerical features).
4. **Train a classifier** using the vectorized features.
5. **Evaluate the model** on test data.

Let's go through each of these steps in Python, assuming you have a DataFrame `df` with two columns: `'label'` (the type of news) and `'content'` (the text of the news article).

### Step-by-Step Code:

1. **Load and preprocess the data:**

   We'll clean and tokenize the news content, and ensure the labels are encoded as categories.

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Load your dataframe
df = pd.read_csv('your_file.csv')  # Replace with your actual file path

# Example of data structure:
# df = pd.DataFrame({'label': ['Sports', 'Politics', 'Technology', ...],
#                    'content': ['Content of news article 1', 'Content of news article 2', ...]})

# Encode the labels into numeric format
label_encoder = LabelEncoder()
df['label_encoded'] = label_encoder.fit_transform(df['label'])

# View the first few rows to check the data
print(df.head())
```

2. **Split the data into training and test sets:**

We will split the data into 80% training and 20% testing.

```python
# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label_encoded'], test_size=0.2, random_state=42)

# View the shapes of the splits to confirm
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
```

3. **Vectorize the text data:**

We will use TF-IDF (Term Frequency-Inverse Document Frequency) to transform the text content into numerical features.

```python
# Initialize a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Fit and transform the training data, then transform the test data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Check the shape of the resulting feature matrices
print(f"Training data shape: {X_train_tfidf.shape}")
print(f"Test data shape: {X_test_tfidf.shape}")
```

4. **Train a classifier:**

We will use the **Naive Bayes** classifier, which works well with text classification problems and TF-IDF features.

```python
# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the model on the training data
nb_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test data
y_pred = nb_classifier.predict(X_test_tfidf)

# Evaluate the model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))
```

### Explanation of the Code:

- **Label Encoding**: The `LabelEncoder` converts categorical labels (like "Sports", "Politics", etc.) into numeric values.
- **TF-IDF Vectorization**: We use `TfidfVectorizer` to convert the text into a numerical format, where each word has a weight based on how frequently it appears in a document, adjusted for its frequency across the entire dataset.
- **Multinomial Naive Bayes**: This classifier is well-suited for text classification, especially when using TF-IDF features. It assumes that the presence of each word is independent, which works well with the "bag of words" model.
- **Model Evaluation**: We evaluate the model using **accuracy**, **precision**, **recall**, and **F1-score** from the `classification_report`.

### Example Output:

You might see output like this:

```
Accuracy: 0.85
Classification Report:
               precision    recall  f1-score   support

       Politics       0.80      0.90      0.85       100
        Sports       0.90      0.80      0.85       100
   Technology       0.85      0.85      0.85       100

    accuracy                           0.85       300
   macro avg       0.85      0.85      0.85       300
weighted avg       0.85      0.85      0.85       300
```

### 5. **Hyperparameter Tuning (Optional):**

If the model’s performance isn’t satisfactory, you could:
- Experiment with different vectorization methods (e.g., using **Word2Vec** or **BERT embeddings**).
- Tune the hyperparameters of the Naive Bayes model or try other models (e.g., **Logistic Regression**, **SVM**, or **Random Forests**).

For more advanced techniques, you can use **deep learning** models like **LSTM**, **CNN**, or **BERT**, which often provide better performance but require more computational resources.

Let me know if you need further details or help with any other part of the process!