# Setting up Kaggle API

Copy the kaggle.json from the local computer

In [None]:
!ls -lha kaggle.json

-rw-r--r-- 1 root root 63 Sep 20 04:03 kaggle.json


In [None]:
!pip install -q kaggle

In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d pknayak/news-api-scrapped-dataset

Downloading news-api-scrapped-dataset.zip to /content
  0% 0.00/669k [00:00<?, ?B/s]
100% 669k/669k [00:00<00:00, 48.6MB/s]


In [None]:
!mkdir ./data

In [None]:
# !unzip /content/pix2code.zip -d pix2code
!unzip ./news-api-scrapped-dataset.zip -d ./data

Archive:  ./news-api-scrapped-dataset.zip
  inflating: ./data/combined_news_response.csv  


# Loading the dataset

In [None]:
import pandas as pd

In [None]:
pd.set_option('display.float_format', '{:.4f}'.format)

In [None]:
data_dir = "./data/"

In [None]:
df = pd.read_csv(data_dir+'combined_news_response.csv', sep = ',')
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,keywords,creator,description,content,pubDate,country,category,language,file_keyword
0,0,IRS announces new plan to hire more agents for...,,['Katherine Donlevy'],The federal agency said it plans to aggressive...,The IRS announced it will start targeting thou...,2023-09-10 02:15:26,['united states of america'],['top'],english,Earnings Reports
1,1,American teen Gauff wins US Open for first major,,,"Coco Gauff, 19, is the youngest US Open winner...",NEW YORK -- Coco Gauff won her first major tit...,2023-09-10 02:10:45,"['india', 'singapore', 'canada', 'australia', ...",['sports'],english,Earnings Reports
2,2,How to watch UFC 293: Adesanya vs. Strickland ...,,['Quentyn Kennemer'],,"When you buy through our links, Insider may ea...",2023-09-10 00:00:01,['united states of america'],['top'],english,Earnings Reports
3,3,Report: Which Utah university gives the best b...,,['Marjorie Cortez'],,A new Utah Foundation report affirms the more ...,2023-09-10 00:00:00,['united states of america'],['sports'],english,Earnings Reports
4,4,American tennis star Coco Gauff wins Grand Sla...,"['In The Know', 'News', 'tennis', 'US Open']",['Julia Shapero'],American tennis star Coco Gauff won the Grand ...,American tennis star Coco Gauff won the Grand ...,2023-09-09 23:12:08,['united states of america'],['top'],english,Earnings Reports


In [None]:
df.columns

Index(['Unnamed: 0', 'title', 'keywords', 'creator', 'description', 'content',
       'pubDate', 'country', 'category', 'language', 'file_keyword'],
      dtype='object')

In [None]:
df.drop('Unnamed: 0',axis=1,inplace=True)
df.columns

Index(['title', 'keywords', 'creator', 'description', 'content', 'pubDate',
       'country', 'category', 'language', 'file_keyword'],
      dtype='object')

In [None]:
df['description'].isnull().sum(),df['content'].isnull().sum()

(19, 0)

# Labelling the dataframe using various techniques

## 1.Using the VADER Sentiment Analysis Tool

In [None]:
labelled_df = df.copy()

### Sample code to check how VADER works

In [None]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

In [None]:
# Sample text
text = "I love this product! It's fantastic."

In [None]:
# Analyze sentiment
sentiment_scores = analyzer.polarity_scores(text)


In [None]:
# Interpret the sentiment scores
compound_score = sentiment_scores['compound']

In [None]:
def tranform_score(compound_score):
  if compound_score >= 0.05:
      sentiment = "Positive"
  elif compound_score <= -0.05:
      sentiment = "Negative"
  else:
      sentiment = "Neutral"

  # print(f"Sentiment: {sentiment}")
  return str(sentiment)


In [None]:
tranform_score(compound_score)

'Positive'

### VADER on the org df

In [None]:
labelled_df.iloc[0]['content'], type(labelled_df.iloc[0]['content'])

('The IRS announced it will start targeting thousands of uber-wealthy tax cheats and make them poney up hundreds of millions of dollars in the coming months. The federal agency said it plans to aggressively hunt down 1,600 millionaires and 75 large business partnerships that routinely “cut corners” on their taxes. “If you pay your taxes on time it should be particularly frustrating when you see that wealthy filers are not,” IRS Commissioner Daniel Werfel said. Each of the millionaire tax dodgers that the IRS set its sights on owes a minimum of at least $250,000 each in back taxes, while the partnerships have assets of roughly $10 billion on average. The IRS is planning to hire an undisclosed number of revenue agents to focus on the high-end collection cases so that the new initiative can start rolling in October this year. The crackdown represents the next phase of an offensive pursuit that the Biden Administration launched earlier this summer — in July, the IRS said it collected $38 m

In [None]:
# Sample text
text = labelled_df.iloc[0]['content']

In [None]:
# Analyze sentiment
sentiment_scores = analyzer.polarity_scores(text)

In [None]:
# Interpret the sentiment scores
compound_score = sentiment_scores['compound']

In [None]:
tranform_score(compound_score)

'Positive'

In [None]:
labelled_df['content'][:50]

In [None]:
labelled_df.iloc[2]['content']

"When you buy through our links, Insider may earn an affiliate commission. Learn more Everyone should be excited for UFC 293's main card. It features middleweight champion Israel Adesanya defending his newly regained title against Sean Strickland, the first defense of this reign. We'll show you how to watch Adesanya vs. Strickland live streams live, no matter where you are. In addition to the middleweight championship fight, UFC is billing a heavyweight fight between Tai Tuivasa and Alexander Volkov as co-main event. There are many other fights leading up to these main events, so we'll show you how to watch UFC 293 live streams in the US, UK, and more. The main card pay-per-view livestream is exclusive to ESPN Plus members in the US, but if you're outside the country, you can tap into a few international viewing options. We'll also guide you step-by-step on how to use a VPN to watch UFC 293, no matter where you are. You might even find some of the international viewing options we've fo

In [None]:
labelled_df.iloc[14]['content']

"Considering how markets are extremely sensitive to US data at the moment, it is best to be prepared for more potential landmines to come before we get to the next FOMC meeting later this month. In the next week, there are going to be some major economic releases and those will feed further into the Fed pricing outlook as noted . The big one to watch will be the CPI report on Wednesday but don't discount the impact of the other data releases as well, especially if there are going to be surprises along the way."

In [None]:
labelled_df.iloc[38]['content']

'With concerns of a recession — or stock market crash — in full swing, next week’s Consumer Price Index ( CPI ) report represents perhaps the single most important economic data release ahead of the Federal Reserve’s upcoming policy decision. Indeed, the CPI report may offer insight into whether the Fed’s war on inflation is coming to a soft or emergency landing. After a surprising increase in prices in July, economists are crossing their fingers that price growth cooled in August. The headline CPI rose 3.2% on an annual basis in July, the first acceleration in more than a year. July’s hotter-than-expected inflation has stoked fears that the Fed may push the rate hike button again soon. As such, next week’s CPI report may force the Fed’s hand at its next policy meeting, Sept. 19-20. With just days between the August CPI and the FOMC meeting, the inflation data will likely prove Fed members’ final consideration before coming to a decision. And what a decision it will be. Even with fears

In [None]:
def predict_sentiment(news_article):
    # Sample text
    text = news_article

    # Analyze sentiment
    sentiment_scores = analyzer.polarity_scores(text)

    # Interpret the sentiment scores
    compound_score = sentiment_scores['compound']

    # based on score denote the sentiment
    sentiment = tranform_score(compound_score)

    return sentiment

In [None]:
predict_sentiment(labelled_df.iloc[38]['content'])

'Negative'

In [None]:
for i in [2,14,38]:
  news_article = labelled_df.iloc[i]['content']
  print(f' Sentiment: "{predict_sentiment(news_article)}" for {news_article}  ')

 Sentiment: "Positive" for When you buy through our links, Insider may earn an affiliate commission. Learn more Everyone should be excited for UFC 293's main card. It features middleweight champion Israel Adesanya defending his newly regained title against Sean Strickland, the first defense of this reign. We'll show you how to watch Adesanya vs. Strickland live streams live, no matter where you are. In addition to the middleweight championship fight, UFC is billing a heavyweight fight between Tai Tuivasa and Alexander Volkov as co-main event. There are many other fights leading up to these main events, so we'll show you how to watch UFC 293 live streams in the US, UK, and more. The main card pay-per-view livestream is exclusive to ESPN Plus members in the US, but if you're outside the country, you can tap into a few international viewing options. We'll also guide you step-by-step on how to use a VPN to watch UFC 293, no matter where you are. You might even find some of the internationa

ChatGPT respones for `labelled_df.iloc[14]['content']`:
>The provided phrase seems to have a somewhat neutral tone with a hint of caution. It discusses the sensitivity of the markets to US data and the need to be prepared for potential challenges before the next FOMC meeting. It also mentions upcoming economic releases and the importance of the CPI report. Overall, it's a balanced statement with neither strongly positive nor negative sentiment.
So, it can be categorized as "Neutral."

THe second one should have been negative. If not negative atleast neutral as mentioned by ChatGPT. Any method that is giving you neutral is good enough to categorize to find the label of the data.


## 2.Using TextBlob

In [None]:
!pip install textblob



In [None]:
from textblob import TextBlob

In [None]:
def predict_sentiment_textblob(news_article):

    # Sample text
    text = news_article

    # Analyze sentiment
    analysis = TextBlob(text)

    # Interpret the sentiment
    polarity = analysis.sentiment.polarity

    if polarity > 0:
        sentiment = "Positive"
    elif polarity < 0:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"

    return sentiment

In [None]:
predict_sentiment_textblob(labelled_df.iloc[14]['content'])

'Positive'

## 3.Using a Pre-trained Model (VADER)

In [None]:
# Example using NLTK and the VADER lexicon-based model
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
# Download the VADER lexicon (if not already downloaded)
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
def predict_sentiment_nltk_vader(news_article):
    # Initialize the VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()

    # Sample text
    text = news_article

    # Analyze sentiment
    sentiment_scores = analyzer.polarity_scores(text)

    # Interpret the sentiment scores
    compound_score = sentiment_scores['compound']

    if compound_score >= 0.05:
        sentiment = "Positive"
    elif compound_score <= -0.05:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"

    return sentiment

In [None]:
predict_sentiment_nltk_vader(labelled_df.iloc[0]['content'])

'Positive'

In [None]:
predict_sentiment_nltk_vader(labelled_df.iloc[2]['content']), \
predict_sentiment_nltk_vader(labelled_df.iloc[14]['content']), \
predict_sentiment_nltk_vader(labelled_df.iloc[38]['content'])

('Positive', 'Positive', 'Negative')

## Using BERT family models to create a labelled data.

* Refere to [link](https://www.notion.so/mekongdelta/MACHINE-LEARNING-COMPETITION-by-TradingGEN-4412d3cd718e41a1b4057d73b7e711d2?pvs=4#07eab6ab22024bee9dc0073f33c6ea33)
  * this is the best way to label the data because it understand the context, teh meaning behind it. others just focus on teh keywords. So focus on labelling the data using the BERT GPT models or llama.

* https://www.notion.so/mekongdelta/Sentiment-Analysis-on-News-Headlines-Classic-Supervised-Learning-vs-Deep-Learning-Approach-d67059f751c54ef1a2877017e1ac944d
  * here the snorkel technique can be used for labelling the unlabbeled dataset

* https://www.notion.so/mekongdelta/Sentiment-Analysis-on-Reddit-News-Headlines-with-Python-s-Natural-Language-Toolkit-NLTK-3c064e6937a644009c97221703748391

Inferences ;

Columns selected:
  * the content column has all the values, so take that for your analysis
  *

# Gradio


For each step in the pdf, create a method for the gradio app.

In [None]:
!pip install gradio

In [None]:
import gradio as gr

In [None]:
demo = gr.Blocks()

# import numpy as np

# target_dtype = np.int16
# max_range = np.iinfo(target_dtype).max


# def speech_to_speech_translation(audio):
#     translated_text = translate(audio)
#     synthesised_speech = synthesize(translated_text)
#     synthesised_speech = (synthesised_speech.numpy() * max_range).astype(np.int16)
#     return 16000, synthesised_speech

# mic_translate = gr.Interface(
#     fn=speech_to_speech_translation,
#     inputs=gr.Audio(source="microphone", type="filepath"),
#     outputs=gr.Audio(label="Generated Speech", type="numpy"),
# )
# file_translate = gr.Interface(
#     fn=speech_to_speech_translation,
#     inputs=gr.Audio(source="upload", type="filepath"),
#     outputs=gr.Audio(label="Generated Speech", type="numpy"),
# )
#
# only one interface of may be more than one each for sentiment analysis and data vizulization
# # Interface 1
# file_translate = gr.Interface(
#     fn=sentiment_anlysis_engine, # function's name to predict the sentiment and score
#     inputs=gr.Audio(source="upload", type="filepath"), # taking phrase as input or direct call the api to get the input
#     outputs=gr.Audio(label="Generated Speech", type="numpy"), # getting the sentiment as one of (negative,positive or neutral) and a score.
# )

#Interface 2 (this is tentative)
# file_translate = gr.Interface(
#     fn=sentiments_data_viz, # function's name to create a data viz of sentiment
#     inputs=gr.Audio(source="upload", type="filepath"), # for a specific date  we can check the sentiment distribution, word clouds for each category and as a whole
#     outputs=gr.Audio(label="Generated Speech", type="numpy"), # check how to get data vizualization in gradio
# )



Selected Data vizs are:
* Bar Charts and Histograms: These are simple and effective for showing sentiment distribution. Each sentiment category (positive, negative, neutral) is represented as a bar, and the height of the bar corresponds to the frequency or percentage of each sentiment in the dataset.

* Pie Charts: Pie charts can show the proportion of each sentiment class in the dataset. Each sentiment category is represented as a slice of the pie, with the size of the slice corresponding to its relative frequency

* Word Clouds: Word clouds visually display the most frequent words in a text corpus. You can create separate word clouds for positive, negative, and neutral sentiments, with word size indicating word frequency.

* Heatmaps: Heatmaps can show sentiment distribution across different categories or topics. Rows and columns represent categories or topics, and each cell's color intensity indicates the sentiment strength or frequency.

In [None]:
with demo:
    gr.TabbedInterface([mic_translate, file_translate], ["Microphone", "Audio File"])

demo.launch(debug=True)

# References:

1. [HF Audio course hands on](https://github.com/MekongDelta-mind/hf_audio_course/blob/main/hands_on_unit_7_hf_audio.ipynb)

# GenAI discussions

<details>
  <summary>What are the different kinds of data vizualization used for representing sentiment outputs from a sentiment anlaysis models?</summary>
  Visualizing sentiment analysis results can provide valuable insights into the sentiment distribution of a dataset. Several data visualization techniques are commonly used to represent sentiment outputs:

1. **Bar Charts and Histograms:** These are simple and effective for showing sentiment distribution. Each sentiment category (positive, negative, neutral) is represented as a bar, and the height of the bar corresponds to the frequency or percentage of each sentiment in the dataset.

2. **Pie Charts:** Pie charts can show the proportion of each sentiment class in the dataset. Each sentiment category is represented as a slice of the pie, with the size of the slice corresponding to its relative frequency.

3. **Line Charts:** Line charts can be used to visualize changes in sentiment over time, especially in time-series data. For example, you can track how sentiment towards a product or topic evolves over weeks or months.

4. **Heatmaps:** Heatmaps can show sentiment distribution across different categories or topics. Rows and columns represent categories or topics, and each cell's color intensity indicates the sentiment strength or frequency.

5. **Scatter Plots:** Scatter plots can be used when sentiment scores are continuous or have a range. Each data point represents a document or text, and the x-axis can represent one sentiment score (e.g., positive), while the y-axis represents another (e.g., negative).

6. **Word Clouds:** Word clouds visually display the most frequent words in a text corpus. You can create separate word clouds for positive, negative, and neutral sentiments, with word size indicating word frequency.

7. **Stacked Bar Charts:** Stacked bar charts show the composition of sentiment within different categories or topics. Each bar is divided into segments, and each segment represents a sentiment category.

8. **Sentiment Timeline:** This type of visualization plots sentiment scores over time, showing how sentiment fluctuates. It can be useful for tracking public sentiment on social media or news articles.

9. **Box Plots and Violin Plots:** These plots show the distribution of sentiment scores, including measures such as median, quartiles, and outliers. They are particularly useful for comparing sentiment across different groups or categories.

10. **Chord Diagrams:** Chord diagrams represent relationships between sentiment categories or topics. The thickness of the chords connecting categories indicates the strength of the relationship.

11. **Network Graphs:** Network graphs can be used to visualize co-occurrence relationships between words or entities within different sentiment categories. Nodes represent words or entities, and edges represent co-occurrence relationships.

12. **Geospatial Maps:** If sentiment analysis is applied to data with geographical information, you can visualize sentiment distribution on maps. Different regions or countries can be colored based on the predominant sentiment.

The choice of visualization depends on the nature of your data, the research questions you want to answer, and the audience you want to communicate with. Often, a combination of these visualization types can provide a comprehensive view of sentiment analysis results.
</details>