<a href="https://colab.research.google.com/github/ShehzadDev/py/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping and Data Manipulation by NLTK

In this Colab notebook, we'll explore how to perform web scraping using Python with the help of the `requests` library for fetching web content and `BeautifulSoup` for parsing HTML. Additionally, we'll use the `pandas` library to manipulate and organize the extracted data in dataframes



In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import copy

#Coursera

In [2]:
# def scrape_coursera(topic):
#     url = 'https://www.coursera.org/courses?query={topic}'
#     response = requests.get(url)
#     soup = BeautifulSoup(response.content, 'html.parser')

#     # Extract relevant information from the Coursera page
#     course_titles = soup.find_all('div', class_='course-title')
#     course_descriptions = soup.find_all('div', class_='course-description')
#     course_levels = soup.find_all('div', class_='course-level')
#     course_prices = soup.find_all('div', class_='course-price')

#     # Create a DataFrame with the extracted data
#     coursera_data = pd.DataFrame({
#         'Course Title': course_titles,
#         'Course Description': course_descriptions
#     })

#     # Convert the coursera_data variable to a string value
#     coursera_data['Course Title'] = coursera_data['Course Title'].to_string()
#     coursera_data['Course Description'] = coursera_data['Course Description'].to_string()

#     return coursera_data

# # Call the function to scrape Coursera data
# coursera_data = scrape_coursera('computer science')
# print(coursera_data)

#Udemy

In [3]:
# def scrape_udemy():
#     url = 'https://www.udemy.com/courses/search/?q=computer%20science'
#     response = requests.get(url)
#     soup = BeautifulSoup(response.content, 'html.parser')

#     # Extract relevant information from the Udemy page
#     course_titles = soup.find_all('h2', class_='course-title')
#     course_descriptions = soup.find_all('div', class_='course-description')
#     course_prices = soup.find_all('div', class_='course-price')

#     # Create a DataFrame with the extracted data
#     udemy_data = pd.DataFrame({
#         'Course Title': course_titles,
#         'Course Description': course_descriptions
#     })

#     return udemy_data

# # Call the function to scrape Udemy data
# udemy_data = scrape_udemy()

# print(udemy_data)

#OCW

In [4]:
# import os
# import fitz  # PyMuPDF
# import pandas as pd

# def extract_title_from_text(text):
#     lines = text.split('\n')
#     title = lines[0].strip() if lines else 'Unknown Title'
#     return title

# def scrape_ocw(file_paths):
#     # Create lists to store data
#     topic_titles = []
#     topic_contents = []

#     for file_path in file_paths:
#         # Extract relevant information from the PDF file
#         with fitz.open(file_path) as pdf_document:
#             # Assuming each page in the PDF is a separate topic
#             for page_num in range(pdf_document.page_count):
#                 page = pdf_document[page_num]
#                 text = page.get_text()
#                 title = extract_title_from_text(text)
#                 topic_titles.append(title)
#                 topic_contents.append(text)

#     # Create a DataFrame with the extracted data
#     ocw_data = pd.DataFrame({
#         'Topic Title': topic_titles,
#         'Topic Content': topic_contents
#     })

#     return ocw_data

# # Example usage with a list of locally downloaded PDF file paths
# pdf_file_paths = [
#     '/content/2c91bd942816c0cca14f216f098c0bf4_MIT6_857S14_Lec01.pdf',
#     '/content/49ff5c8ba85258a7e9da7e6463687420_MIT6_857S14_Lec03.pdf',
#     '/content/af613d5bbf100539a9ac8c05ecfd1b64_MIT6_857S14_Lec05.pdf',
#     '/content/7fe96705a82149e2de1d5f20ca34595b_MIT6_857S14_Lec02.pdf',
#     '/content/a526654fdcc273259d1e2bc5b4989d4d_MIT6_857S14_Lec04.pdf',
#     '/content/b5053bc3f3c29468545043e2666b0764_MIT6_857S14_Lec10.pdf',
#     '/content/b5053bc3f3c29468545043e2666b0764_MIT6_857S14_Lec10.pdf'
# ]

# ocw_data = scrape_ocw(pdf_file_paths)

# # Display the DataFrame with 'Topic Title' and 'Topic Content'
# print(ocw_data)

# # Save the DataFrame to a CSV file
# ocw_data.to_csv("ocw.csv", index=False)


# Wikipedia Topic Scraping


1. **Fetching Page Titles:**
   - The script constructs Wikipedia API URLs to search for each topic's page titles.
   - Requests are made to the API, and the response data is converted to JSON format.
   - Relevant information (page titles) is extracted from the search results.

2. **Fetching Page Content:**
   - Using the obtained page titles, new Wikipedia API URLs are constructed to fetch page content.
   - Requests are made to the API, and the response data is converted to JSON format.
   - Relevant information (page content) is extracted from the content results.

3. **Creating DataFrame:**
   - The extracted page titles, along with the corresponding content, are organized into a pandas DataFrame.
   - The DataFrame includes columns for 'Topic ID', 'Topic Name', and 'Topic Content Description'.




In [5]:
def scrape_wikipedia(topics):

    # Create a list of Wikipedia API URLs for each topic to get page titles
    titles_urls = [f'https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srsearch={topic}' for topic in topics]

    # Make requests to the Wikipedia API to get page titles
    titles_responses = [requests.get(url) for url in titles_urls]

    # Convert the response data to JSON format
    titles_data = [response.json() for response in titles_responses]

    # Extract relevant information from the Wikipedia search results
    page_titles = [result['title'] for d in titles_data for result in d['query']['search']]

    # Create a list of Wikipedia API URLs for each topic to get page content
    content_urls = [f'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles={title}&exintro&explaintext' for title in page_titles]

    # Make requests to the Wikipedia API to get page content
    content_responses = [requests.get(url) for url in content_urls]
    # print(len(content_urls))

    # Convert the response data to JSON format
    content_data = [response.json() for response in content_responses]

    # Extract relevant information from the Wikipedia content results
    page_contents = [next(iter(d['query']['pages'].values()))['extract'] for d in content_data]
    # Create a DataFrame with the extracted data
    wikipedia_data = pd.DataFrame({'Topic Title': page_titles, 'Topic Content': page_contents,'Urls':content_urls})

    return wikipedia_data

# Example usage
topics = ['Computer Networks', 'Web Development', 'Machine Learning', 'Natural Language processing',
          'OOP', 'Programming Fundamentals', 'DSA', 'Object Oriented Analysis and Design', 'DBMS']

wikipedia_data = scrape_wikipedia(topics)

# Display the DataFrame with titles and content
print(wikipedia_data)


                     Topic Title  \
0               Computer network   
1     Port (computer networking)   
2               Wireless network   
3    Computer Networks (journal)   
4               Computer science   
..                           ...   
85                    Rel (DBMS)   
86  Isolation (database systems)   
87               Object database   
88               Oracle Database   
89    Open Database Connectivity   

                                        Topic Content  \
0   A computer network is a set of computers shari...   
1   In computer networking, a port or port number ...   
2   A wireless network is a computer network that ...   
3   Computer Networks is a scientific journal of c...   
4   Computer science is the study of computation, ...   
..                                                ...   
85  Rel is an open-source true relational database...   
86  In database systems, isolation determines how ...   
87  An object database or object-oriented database... 

In [6]:
wikipedia_data.columns

Index(['Topic Title', 'Topic Content', 'Urls'], dtype='object')

In [7]:
# Concatenate all the dataframes into a single dataframe
# corpus = pd.concat([ ocw_data , wikipedia_data], ignore_index=True)
corpus=copy.deepcopy(wikipedia_data)

# Display the combined dataframe
print(corpus)

                     Topic Title  \
0               Computer network   
1     Port (computer networking)   
2               Wireless network   
3    Computer Networks (journal)   
4               Computer science   
..                           ...   
85                    Rel (DBMS)   
86  Isolation (database systems)   
87               Object database   
88               Oracle Database   
89    Open Database Connectivity   

                                        Topic Content  \
0   A computer network is a set of computers shari...   
1   In computer networking, a port or port number ...   
2   A wireless network is a computer network that ...   
3   Computer Networks is a scientific journal of c...   
4   Computer science is the study of computation, ...   
..                                                ...   
85  Rel is an open-source true relational database...   
86  In database systems, isolation determines how ...   
87  An object database or object-oriented database... 

# Text Data Cleaning

1. **Remove Extra Whitespaces:**
   - Utilizes the `split` and `join` functions to eliminate additional whitespaces, ensuring text is well-formatted.

2. **Remove Special Symbols and Non-Alphanumeric Characters:**
   - Applies a regular expression (`re.sub`) to keep only alphanumeric characters and spaces, discarding special symbols.

3. **Remove and Replace Broken Lines:**
   - Uses a regular expression to remove newline characters (`\n`) and replaces them with a space to maintain continuity.

4. **Convert to Lowercase:**
   - Transforms the entire text to lowercase to achieve uniformity in the dataset.



In [8]:
import re

# Function to clean text data
def clean_text(text):
    # Remove extra whitespaces
    text = ' '.join(text.split())

    # Remove special symbols and non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Remove and replace broken lines
    text = re.sub(r'\n', ' ', text)

    # Convert to lowercase
    text = text.lower()

    return text

# Assuming 'wikipedia_data' is your DataFrame and it contains a 'Topic Content' column
corpus['Topic Content'] = corpus['Topic Content'].apply(clean_text)

# Display the cleaned DataFrame
print(corpus['Topic Content'])
print(corpus.columns)


0     a computer network is a set of computers shari...
1     in computer networking a port or port number i...
2     a wireless network is a computer network that ...
3     computer networks is a scientific journal of c...
4     computer science is the study of computation i...
                            ...                        
85    rel is an opensource true relational database ...
86    in database systems isolation determines how t...
87    an object database or objectoriented database ...
88    oracle database commonly referred to as oracle...
89    in computing open database connectivity odbc i...
Name: Topic Content, Length: 90, dtype: object
Index(['Topic Title', 'Topic Content', 'Urls'], dtype='object')


In [9]:
vocab_size=[]
word_count=[]
for content in corpus['Topic Content']:
  vocab_size.append(len(set(content.split(' '))))
  word_count.append(len(content.split(' ')))
corpus['vocabulary size'],corpus['word count']=vocab_size,word_count
corpus.to_csv('/content/corpus.csv')

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
corpus.columns

Index(['Topic Title', 'Topic Content', 'Urls', 'vocabulary size',
       'word count'],
      dtype='object')

# Text Tokenization with NLTK in Python

## Importing NLTK and Downloading Resources

The code begins by importing the necessary module for tokenization from NLTK (`word_tokenize`). Additionally, the `nltk.download('punkt')` command ensures that the required 'punkt' resource is downloaded. This resource contains pre-trained models for tokenization.


In [12]:
tokens=copy.deepcopy(corpus[['Topic Title', 'Topic Content']])

tokens.csv: topic, stems, lemmas

In [13]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Function to tokenize text data
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

# Assuming 'wikipedia_data' is your DataFrame and it contains a 'Cleaned Content' column
tokens['Tokens'] = tokens['Topic Content'].apply(tokenize_text)

# Display the tokenized DataFrame
print(tokens[['Topic Content', 'Tokens']])
print(tokens.columns)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


                                        Topic Content  \
0   a computer network is a set of computers shari...   
1   in computer networking a port or port number i...   
2   a wireless network is a computer network that ...   
3   computer networks is a scientific journal of c...   
4   computer science is the study of computation i...   
..                                                ...   
85  rel is an opensource true relational database ...   
86  in database systems isolation determines how t...   
87  an object database or objectoriented database ...   
88  oracle database commonly referred to as oracle...   
89  in computing open database connectivity odbc i...   

                                               Tokens  
0   [a, computer, network, is, a, set, of, compute...  
1   [in, computer, networking, a, port, or, port, ...  
2   [a, wireless, network, is, a, computer, networ...  
3   [computer, networks, is, a, scientific, journa...  
4   [computer, science, is, the, st

# Text Preprocessing with NLTK: Stemming and Lemmatization

## Importing NLTK and Downloading Resources

The code begins by importing the required modules from NLTK (`PorterStemmer`, `WordNetLemmatizer`, and `stopwords`). It also ensures that the necessary NLTK resources ('stopwords' and 'wordnet') are downloaded.

```python
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
nltk.download('wordnet')


In [14]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

# Download NLTK resources if not already downloaded
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

# Function to perform stemming
def stem_text(tokens):
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Function to perform lemmatization
def lemmatize_text(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Remove stopwords and apply stemming and lemmatization to the 'Tokens' column
stop_words = set(stopwords.words('english'))

tokens['Stemmed Tokens'] = tokens['Tokens'].apply(lambda x: [token for token in x if token not in stop_words])
tokens['Stemmed Tokens'] = tokens['Stemmed Tokens'].apply(stem_text)

tokens['Lemmatized Tokens'] = tokens['Tokens'].apply(lambda x: [token for token in x if token not in stop_words])
tokens['Lemmatized Tokens'] = tokens['Lemmatized Tokens'].apply(lemmatize_text)
tokens['Unique Words']=[list(set(content)) for content in tokens['Lemmatized Tokens']]

# Display the DataFrame with stems and lemmas
print(tokens[['Topic Title', 'Stemmed Tokens', 'Lemmatized Tokens']])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                     Topic Title  \
0               Computer network   
1     Port (computer networking)   
2               Wireless network   
3    Computer Networks (journal)   
4               Computer science   
..                           ...   
85                    Rel (DBMS)   
86  Isolation (database systems)   
87               Object database   
88               Oracle Database   
89    Open Database Connectivity   

                                       Stemmed Tokens  \
0   [comput, network, set, comput, share, resourc,...   
1   [comput, network, port, port, number, number, ...   
2   [wireless, network, comput, network, use, wire...   
3   [comput, network, scientif, journal, comput, t...   
4   [comput, scienc, studi, comput, inform, autom,...   
..                                                ...   
85  [rel, opensourc, true, relat, databas, manag, ...   
86  [databas, system, isol, determin, transact, in...   
87  [object, databas, objectori, databas, databas,... 

In [15]:
tokens[['Topic Title', 'Stemmed Tokens', 'Lemmatized Tokens']].to_csv('/content/tokens.csv')

#vocabulary.csv: topic, unique words

In [16]:
vocabulary = copy.deepcopy(tokens[['Topic Title', 'Unique Words']])

In [17]:
vocabulary.to_csv('/content/vocabulary.csv')


# Part-of-Speech Tagging with spaCy

we use spaCy, a natural language processing library, to perform Part-of-Speech (POS) tagging on the tokenized words. POS tagging involves assigning grammatical categories (such as nouns, verbs, adjectives, etc.) to each word in a text.

## Downloading spaCy Model

The code starts by checking if the spaCy model 'en_core_web_sm' is already downloaded. If not, it downloads the model using the command `!python -m spacy download en_core_web_sm`.



#pos.csv: topic, data with pos tags

In [18]:
pos = copy.deepcopy(tokens[['Topic Title','Tokens']])

In [19]:
import spacy
!python -m spacy download en_core_web_sm

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Function to perform POS tagging
def pos_tagging(tokens):
    text = ' '.join(tokens)  # Join tokens into a single string
    doc = nlp(text)
    pos_tags = [(token.text, token.pos_) for token in doc]
    return pos_tags

# Apply POS tagging to the 'Tokens' column
pos['POS Tags'] = pos['Tokens'].apply(pos_tagging)

# Display the DataFrame with POS tags
print(pos.head())


2023-12-29 12:33:33.038351: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-29 12:33:33.038433: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-29 12:33:33.040199: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load

In [20]:
pos[['Topic Title','POS Tags']].to_csv('/content/pos.csv')


##readabilty.csv

# Calculating Readability Metrics for Text Content

readability metrics are calculated for a given set of text content using the `textstat` library. The `calculate_readability_metrics` function applies various readability formulas to assess the complexity and understandability of the text.

## Readability Metrics

The following readability metrics are computed for each piece of text content:

- **Flesch Reading Ease:** A score indicating how easy or difficult the text is to understand. Higher scores correspond to easier readability.

- **Flesch-Kincaid Grade Level:** The grade level required to understand the text. It represents the number of years of education needed to comprehend the content.

- **SMOG Index:** A measure of text readability based on the number of polysyllabic words. It estimates the years of education a person needs to understand the text.

- **Coleman-Liau Index:** An index indicating the understandability of the text. It assesses the text's complexity based on characters per word and words per sentence.

- **Automated Readability Index:** A formula for assessing the understandability of a text. It provides a score that corresponds to the U.S. school grade level.

- **Dale-Chall Readability Score:** A readability measure that considers a list of familiar words. It estimates the ease of comprehension for readers with different education levels.



In [21]:
readability=copy.deepcopy(corpus[['Topic Title','Topic Content']])

In [22]:
# Install the required library if not installed
!pip install textstat

import textstat

# Function to calculate readability metrics
def calculate_readability_metrics(text):
    readability_metrics =  textstat.flesch_kincaid_grade(text)
    if readability_metrics <=8:
      return readability_metrics
    else: return readability_metrics

# Apply the function to the 'Cleaned Title' column
readability['Readability Metrics'] = readability['Topic Content'].apply(calculate_readability_metrics)

# Display the DataFrame with readability metrics
readability.head()



Collecting textstat
  Downloading textstat-0.7.3-py3-none-any.whl (105 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyphen (from textstat)
  Downloading pyphen-0.14.0-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.14.0 textstat-0.7.3


Unnamed: 0,Topic Title,Topic Content,Readability Metrics
0,Computer network,a computer network is a set of computers shari...,80.9
1,Port (computer networking),in computer networking a port or port number i...,94.6
2,Wireless network,a wireless network is a computer network that ...,44.7
3,Computer Networks (journal),computer networks is a scientific journal of c...,18.2
4,Computer science,computer science is the study of computation i...,110.6


In [23]:
readability.to_csv('/content/readability.csv')

### Read the test assign the readability level of the text se readability as y, use logistic regression to classify whether the data is simple or easy

In [24]:
Data= copy.deepcopy(readability[['Topic Content','Readability Metrics']])

In [25]:
Data.head()

Unnamed: 0,Topic Content,Readability Metrics
0,a computer network is a set of computers shari...,80.9
1,in computer networking a port or port number i...,94.6
2,a wireless network is a computer network that ...,44.7
3,computer networks is a scientific journal of c...,18.2
4,computer science is the study of computation i...,110.6


In [26]:
!pip install transformers sentence_transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')



Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence_transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=0b37d322a3fbf4cbe457da535ef031782eedc6d325650faec2d3da62ada2e699
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence_tr

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [27]:

Data['features']=Data['Topic Content'].apply(lambda x : model.encode(x))

In [28]:
Data.head()

Unnamed: 0,Topic Content,Readability Metrics,features
0,a computer network is a set of computers shari...,80.9,"[0.03143598, 0.006238364, -0.018474584, -0.061..."
1,in computer networking a port or port number i...,94.6,"[-0.002639845, -0.0065796883, -0.019606123, -0..."
2,a wireless network is a computer network that ...,44.7,"[-0.06350159, 0.03695857, -0.034751248, -0.026..."
3,computer networks is a scientific journal of c...,18.2,"[-0.032327004, 0.0051970826, -0.013547472, 0.0..."
4,computer science is the study of computation i...,110.6,"[-0.026379341, 0.05121022, -0.055364273, -0.04..."


In [29]:
import numpy as np

X=np.array(model.encode(Data['Topic Content']))
Data=Data[['features','Readability Metrics']]

In [30]:
X.shape

(90, 384)

In [31]:
!pip install scikit-learn



In [32]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.datasets import load_iris  # Example dataset
import numpy as np

# Load a sample dataset (you can replace this with your dataset)
# X = np.array(Data['features'])  # features
y = np.array(Data['Readability Metrics'])  # target variable

# Convert the target variable into a binary classification problem
# For example, you can create two classes based on a threshold
threshold = 0.5  # Set a threshold based on your problem
y_binary = (y > threshold).astype(int)

# Split data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)


In [33]:
X_train

array([[-4.2068530e-02, -7.2173841e-02, -2.1158790e-02, ...,
         2.3613064e-02,  5.3143222e-02,  4.2855434e-02],
       [-4.1256122e-02, -8.0317326e-02,  4.9232827e-03, ...,
        -1.0070503e-01,  2.3657857e-02, -2.0852825e-02],
       [-3.2091722e-02,  6.7567393e-02, -8.7988839e-02, ...,
         3.3912759e-02,  3.1175395e-02, -1.7805958e-02],
       ...,
       [-2.5111824e-02,  4.1959327e-02, -7.7525191e-02, ...,
         8.5887253e-02, -8.7249913e-04,  2.6781622e-02],
       [-8.4943481e-02,  4.6718769e-02,  6.5144494e-02, ...,
         3.1426558e-03,  1.1434577e-01, -5.3322099e-02],
       [-1.5244385e-02,  6.5314947e-03, -2.4643060e-02, ...,
         1.0000869e-01,  1.4433688e-01,  1.1238927e-04]], dtype=float32)

In [34]:
# Initialize logistic regression model
model = LogisticRegression(max_iter=1000)  # Increased max_iter for convergence; adjust as needed

model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate precision, recall, and F1-score
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the scores
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')

Precision: 0.79
Recall: 0.89
F1 Score: 0.84


  _warn_prf(average, modifier, msg_start, len(result))
