# Kaholas Assignment
Instructions:
1. Visit huggingface.co and create an account.
2. Choose any pre-trained model for Natural Language Processing from the Hugging Face Models Hub.
3. Write a code in Colab that uses the Hugging Face API to perform any 2 of the following:  
   a. Text classification  
   b. Sentiment analysis  
   c. Text generation  
   d. Question Answering  
4. Make sure you use code annotations and comments to make your code understandable.
5. Submit the Colab link with open to all permission.

Submission Guidelines:
1. The Colab notebook should be fully functional and able to run without errors.
2. The Colab notebook should include annotations and comments explaining the code.
3. The Colab notebook should have open to all permission to ensure proper grading.

Dataset:
https://www.kaggle.com/datasets/gpreda/bbc-news

**Let's Start our tasks step-by-step**
## Step 1. Installing & importing all the require libraries

In [1]:
!pip install requests
!pip install opendatasets



In [21]:
import requests
import pandas as pd
import opendatasets as od
import jovian

## Step 2: Load the dataset
* I have stored the dataset in a file_name `bbcnews_dataset` on drive.
* So, I'm going to use directly form my goole drive
* In colab we have to import the `drive` from `google.colab` to load the path  
let's see:

In [3]:
dataset=od.download('https://www.kaggle.com/datasets/gpreda/bbc-news')

Skipping, found downloaded files in "./bbc-news" (use force=True to force download)


In [4]:
df1=pd.read_csv('bbc-news/bbc_news.csv')

## Step 3: Let's understand the dataset
* Now let's understand our datase
* Try to set the tasks
* If needed then make a seprated DataFrame to achive your target


In [5]:
df1.head(10)

Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...
5,Ukraine war: PM to hold talks with world leade...,"Mon, 07 Mar 2022 08:33:29 GMT",https://www.bbc.co.uk/news/uk-60642926,https://www.bbc.co.uk/news/uk-60642926?at_medi...,Boris Johnson is to meet the Canadian and Dutc...
6,Ukraine war: UK grants 50 Ukrainian refugee vi...,"Mon, 07 Mar 2022 08:09:21 GMT",https://www.bbc.co.uk/news/uk-60640460,https://www.bbc.co.uk/news/uk-60640460?at_medi...,"The home secretary says she is ""surging capaci..."
7,TikTok limits services as Netflix pulls out of...,"Mon, 07 Mar 2022 00:11:59 GMT",https://www.bbc.co.uk/news/business-60641988,https://www.bbc.co.uk/news/business-60641988?a...,TikTok suspends live streaming and new content...
8,"Covid: Fourth jab for Scotland's vulnerable, a...","Mon, 07 Mar 2022 07:46:30 GMT",https://www.bbc.co.uk/news/uk-60640975,https://www.bbc.co.uk/news/uk-60640975?at_medi...,Five things you need to know about the coronav...
9,Protests across Russia see thousands detained,"Sun, 06 Mar 2022 23:23:59 GMT",https://www.bbc.co.uk/news/world-europe-60640204,https://www.bbc.co.uk/news/world-europe-606402...,"People have been held in 53 cities, from St Pe..."


In [6]:
df1.shape

(15159, 5)

**Observation:**
* There is `5-columns` and `15130-Rows`.

In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15159 entries, 0 to 15158
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        15159 non-null  object
 1   pubDate      15159 non-null  object
 2   guid         15159 non-null  object
 3   link         15159 non-null  object
 4   description  15159 non-null  object
dtypes: object(5)
memory usage: 592.3+ KB


**Observation:**
* There isn't null values in any of the column.

In [8]:
unique_title=df1.title.unique()
len(unique_title)

14497

In [9]:
unique_discription=df1.description.unique()
len(unique_discription)

14250

In [10]:
unique_link=df1.link.unique()
len(unique_link)

13838

In [11]:
df1.link.duplicated().sum()

1321

**Observations:**
* As we can see the difference in unique values presented in some key columns.
* We need to remove these repeated rows.
* Since uniquness of the news likely depends on the link and also this column has more duplicates values in comparisions to others. 
* So, I'll use the `drop_duplicated()` method on 'link' column.

In [12]:
# Drop duplicate rows based on a subset of columns
df2 = df1.drop_duplicates(subset=['link'])
df2.shape

(13838, 5)

In [13]:
df2=pd.DataFrame(df2)
df2.head()

Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


## Step 4: Filter the datasets
* Since our  given each tasks is based on texts and in this dataset we have only to columns which is related.
* First one is `title` and the second one is `description`.
* So, let's seprate it and form a new dataset. 

In [14]:
data=df2[['title','description']]
data.head()

Unnamed: 0,title,description
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...


## Step 5: Prepare the input data
* We have two columns one is title and the second is respective discription of the title.
* First I'll change the both columns into a list and merge them and make a master list with the title and respective description.

In [15]:
data['articles']=data['title']+' - ' + data['description']
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['articles']=data['title']+' - ' + data['description']


Unnamed: 0,title,description,articles
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,Ukraine: Angry Zelensky vows to punish Russian...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",War in Ukraine: Taking cover in a town under a...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,Ukraine war 'catastrophic for global food' - O...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,Manchester Arena bombing: Saffie Roussos's par...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Ukraine conflict: Oil price soars to highest l...


In [16]:
articles=data['articles'].tolist()

In [17]:
type(articles)

list

## Step 6: Importing the model through API

### 1. Sentiment Analysis:

In [18]:
API_URL = "https://api-inference.huggingface.co/models/cardiffnlp/twitter-xlm-roberta-base-sentiment"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
output = query({
	"inputs": 'Ukraine: Angry Zelensky vows to punish Russians'
})


In [19]:
output

{'error': 'Rate limit reached. You reached free usage limit (reset hourly). Please subscribe to a plan at https://huggingface.co/pricing to use the API at this rate'}

In [20]:
output=query({'inputs':'Ukraine: Angry Zelensky vows to punish Russian'})
output[0][0]['label']

KeyError: 0

**Note:**
* The output is 2-dimesional array.
* Sentiment score are in decending order.
* Let's make it more simpler for the viewer.
* Also I'm going define a fuction to use this model on our dataset. 

In [None]:
def texts_sentiment(texts):
    results = []
    for text in texts:
        output = query({"inputs": text})
        sentiment = output[0][0]['label']
        results.append(sentiment)
    
    sentiment_df = pd.DataFrame({'article': texts, 'sentiment': results})
    return sentiment_df


In [None]:
def texts_sentiment(texts):
    results = []
    for text in texts:
        output = query({"inputs": text})
        if output:  # Check if output is not empty
            sentiment = output[0][0]['label']
            results.append(sentiment)
        else:
            results.append(None)  # Add None to results if output is empty

    sentiment_df = pd.DataFrame({'articles': texts, 'sentiment': results})
    return sentiment_df


In [None]:
sentiments=texts_sentiment(articles)
sentiments

### 2. Texts Generation:

In [None]:
API_URL = "https://api-inference.huggingface.co/models/gpt2-medium"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Ukraine: Angry Zelensky vows to punish Russian atrocities - The Ukrainian president says the country will not forgive or forget those who murder its civilians. ",
})
output

In [None]:
output

**Note:**
* The out is .json texts
* Now I'm going to build a function which takes the `articles` as input and generate the more text on each article topic and then save in a dataframe to merge in the original `data`. 

In [None]:
def texts_genrator(articles):
  results=[]
  for article in articles:
    output=query({"inputs":article})
    generated_texts=output
    results.append(generated_texts)

  generated_texts_df=pd.DataFrame({'articles':articles,'generated articles':results})

  return generated_texts_df


In [None]:
generated_texts = texts_generator(articles)
generated_texts

In [None]:
articles[0]

In [None]:
API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-j-6B"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Ukraine: Angry Zelensky vows to punish Russian atrocities - The Ukrainian president says the country will not forgive or forget those who murder its civilians.",
})
output

In [None]:
jovian.commit(project='Kaholas_Assignment')

<IPython.core.display.Javascript object>