# General Test and Review Advice
* The multiple choice portion of the exam will account for 40% of your grade on the Final, while the programming portion will account for 60%.
* The multiple choice portion is entirely closed book (taken through Respondus Browser) and primarily will assess your knowledge of major Python syntax, NLP, and Machine Learning libraries .  Sessions 22-27, 30-38 (and this review) are most pertinent for this portion.
* The programming and analysis portion will consist of three major aspects: NLP processing of text, a simple supervised learning task, and some visualization of results.

# Topic 1: Object-Oriented Programming (Sessions 22-23)
* Object-Oriented Programming is a paradigm in which the focus of software engineering is to compartmentalize program responsibilities and behavior to individual "entities" (generally referred to as classes/objects).
* In general, the **class** defines a blueprint from which any number of **objects** (or instances of the particular class) may be constructed.  
    * Consider the conceptual class _car_ which provides a single blueprint (general guidelines and qualities) for actual instances of the purportedly 1.4 billion _car_ objects in the world, each of which has varying makes, models, and other qualities. 
* Object variables (**attributes**) representing the state of a particular object (its characteristics, etc.)
* Object functions (**methods**) represent specific behaviors the object can perform.

In [2]:
# account.py
"""Account class definition."""
from decimal import Decimal

class Account:
    """Account class for maintaining a bank account balance."""
    
    def __init__(self, name, balance):
        """Initialize an Account object."""

        # if balance is less than 0.00, raise an exception
        if balance < Decimal('0.00'):
            raise ValueError('Initial balance must be >= to 0.00.')

        self.name = name
        self.balance = balance

    def deposit(self, amount):
        """Deposit money to the account."""

        # if amount is less than 0.00, raise an exception
        if amount < Decimal('0.00'):
            raise ValueError('amount must be positive.')

        self.balance += amount

In [3]:
accounts =[]
accounts.append(Account('John Green', Decimal('50.00')))
accounts.append(Account('Marsha Braeburn', Decimal('100000.00')))
accounts.append(Account('Pat Arkthen', Decimal('753.00')))


In [4]:
print(f'{"Account Name":>15}{"Balance":>15}')
for acct in accounts:
    print(f'{acct.name:>15}{acct.balance:>15}')
    

   Account Name        Balance
     John Green          50.00
Marsha Braeburn      100000.00
    Pat Arkthen         753.00


In [5]:
accounts[0].deposit(Decimal(1000))
print(accounts[0].balance)

1050.00


In [6]:
accounts[0].deposit(Decimal(-1000))


ValueError: amount must be positive.

## Controlling Access to Attributes (Encapsulation)
* Our prior example used attributes `name` and `balance` only to _get_ the values of those attributes
* In this example, we could also have used these to _modify_ their values directly.
* A class’s **client code** is any code that uses objects of the class
* Most object-oriented programming languages enable you to **encapsulate** (or _hide_) an object’s data from the client code
    * AKA make  data _secure_ or _private_
* **Python does _not_ have private data**
* Instead, it uses _naming conventions_ to design classes that encourage correct use
* By convention, Python programmers know that any attribute name beginning with an underscore (`_`) is for a class’s _internal use only_
* The same can be applied to _support methods_ -- these methods are implemented solely with the intention of supporting other (usually "public") methods
* Attributes and methods whose identifiers do _not_ begin with an underscore (`_`) are considered _publicly accessible_ for use in client code

# Properties for Data Access
* **Properties** can control the manner in which an object’s data is _accessed_ and _modified_
    * Again, here we are **assuming programmers follow conventions** -- this still does not guarantee "private" data-types.
* **Properties** provide a sort of _pseudo-variable_ that generally accesses an underlying "real" variable, and permits additional functionality (most commonly, **data validation**)
* The most common properties are **getters** and **setters**
    * A _getter_ method returns a data attribute's value, and is prefaced with the line `@property`
    * A _setter_ method sets a data attribute's value(s) to a specified type(s).  It is prefaced with the line `@<property_name>.setter`, where _property_name_ refers to the property in question being set.


In [7]:

class Fraction:
    """Class Fraction with properties."""

    def __init__(self, numerator=0, denominator=1):
        """Initialize each attribute."""
        self.numerator = numerator  # Note: this will actually use a property setter.
        self.denominator = denominator  # As will this one.
    
    @property
    def numerator(self):
        """Return the numerator."""
        return self._numerator

    @numerator.setter
    def numerator(self, numerator):
        """Set the numerator."""
        if not (isinstance(numerator, int)): #Data validation (numerator must be a whole number ) 
            raise ValueError(f'The numerator must be a whole number')
        self._numerator = numerator
    
    @property
    def denominator(self):
        """Return the denominator."""
        return self._denominator

    @denominator.setter
    def denominator(self, denominator):
        """Set the denominator."""
        if not (isinstance(denominator, int)): #Data validation (denominator must be a whole number ) 
            raise ValueError(f'The denominator must be a whole number.')
        if denominator<1: #Data validation (denominator must be 1+) 
            raise ValueError(f'The denominator must be greater than 0.')

        self._denominator = denominator

In [8]:
myfrac1=Fraction(1,5)
print(f'({myfrac1.numerator} / {myfrac1.denominator})')

(1 / 5)


In [9]:
myfrac2=Fraction(1.1,5)
print(f'({myfrac1.numerator} / {myfrac1.denominator})')

ValueError: The numerator must be a whole number

In [10]:
myfrac2=Fraction(2,-1)
print(f'({myfrac1.numerator} / {myfrac1.denominator})')

ValueError: The denominator must be greater than 0.

## More on Classes and Inheritance
* For most _common_ tasks and applications, will likely find an open-source class or library has already been written.
    * However, you may find that for a specialized task it is useful to write your own class(es) from scratch _or_ derive them from an existing class.
* When creating a new class, you can **inherit** the attributes (variables) and methods (the class version of functions) of a previously defined **base class** (also called a **superclass**)
* The new class is called a **derived class** (or **subclass**)
* You then customize the derived class to meet the specific needs of your application.
    * This can provide a nice compromise between constructing a custom class completely from scratch vs simply using a pre-written libary as is, which may not be a perfect fit for your needs.
* Remember that we created many inheriting classes during the Twitter API sessions.

```python
# salariedcommissionemployee.py
"""SalariedCommissionEmployee derived from CommissionEmployee."""
from commissionemployee import CommissionEmployee
from decimal import Decimal

class SalariedCommissionEmployee(CommissionEmployee):
    """An employee who gets paid a salary plus 
    commission based on gross sales."""

    def __init__(self, first_name, last_name, ssn, 
                 gross_sales, commission_rate, base_salary):
        """Initialize SalariedCommissionEmployee's attributes."""
        super().__init__(first_name, last_name, ssn, 
                         gross_sales, commission_rate)
        self.base_salary = base_salary  # validate via property

    @property
    def base_salary(self):
        return self._base_salary

    @base_salary.setter
    def base_salary(self, salary):
        """Set base salary or raise ValueError if invalid."""
        if salary < Decimal('0.00'):
            raise ValueError('Base salary must be >= to 0')
        
        self._base_salary = salary

    def earnings(self): #distinct from CommissionEmployee
        """Calculate earnings."""   
        return super().earnings() + self.base_salary

    def __repr__(self): #distinct from CommissionEmployee
        """Return string representation for repr()."""
        return ('Salaried' + super().__repr__() +      
            f'\nbase salary: {self.base_salary:.2f}')

```

# Topic 2: Natural Language Processing (24-27)
* Natural language processing is one of the most crucial fields of study across several domains of expertise
    * Machine learning experts seek to garner information from and classify text and conversations into various categories.  
    * AI and Robotics experts seek to emulate human speech and interactions through a machine interface. 
    * Linguists seek to understand the morphology and structure of communication across languages, cultures, and media. 
* Natural language processing is performed in a variety of applications, and across a variety of media.
* In general, we refer to collections of text and related media processed through natural language processing as `corpora`.
    * Corpora include novels, tweets, facebook posts, movie reviews, etc.
* **Nuances of meaning** make natural language understanding _incredibly challenging_.

## `textblob` and `TextBlob`
* `textblob` is the most popular library for NLP in Python in the current era.
    * It is built primarily on the combined efforts behind the `nltk` (Natural Language Toolkit) and `pattern` (Pattern) library
* The `TextBlob` is the standard class available within the `textblob` library intended to address most or all NLP tasks.  
    * It is best conceptualized as a Python `string` object _plus_ attributes, methods, and properties that enable NLP functionality.
* We addressed a number of tasks available through `textblob`, but bear in mind we also looked some other NLP tasks without `textblob` involement (readability scores via `Textatistic`, etc.)
* **Only the tasks outlined below are likely to be prominent on the final exam.**

### Sentence and Word Tokenization

In [11]:

from textblob import TextBlob
blob=TextBlob('Today is a beautiful day. Tomorrow looks like bad weather.')
print(blob.sentences) #automatically tokenizes and returns a list of sentences in the TextBlob} #Sentence tokenization
print(blob.words) #word tokenization

[Sentence("Today is a beautiful day."), Sentence("Tomorrow looks like bad weather.")]
['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow', 'looks', 'like', 'bad', 'weather']


### Parts-of-Speech Tagging


In [12]:
blob = TextBlob('Today is a beautiful day. Tomorrow looks like bad weather.')
blob.tags

[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

### Extracting Noun Phrases

In [15]:
blob = TextBlob('Today is a beautiful day. Tomorrow looks like bad weather.')
print(blob)
blob.noun_phrases

Today is a beautiful day. Tomorrow looks like bad weather.


WordList(['beautiful day', 'tomorrow', 'bad weather'])

## Sentiment Analysis with TextBlob’s Default Sentiment Analyzer
* Extremely important NLP task with wide application
* Also **one of the hardest NLP tasks** due to the combined factors of textual context, things unrelayed in text (sarcasm, tone, etc.), and the differing nature of language across platforms.
* TextBlob sentiment analysis returns two components based on the composition of underlying words:
    * Polarity: -1 (really bad)  to 1 (really good)
    * Subjectivity: 0 (completely objective) to 1 (completely subjective)

In [20]:
blob1 =  TextBlob('The main course at the restaurant was absolutely terrible.')
blob1.sentiment

Sentiment(polarity=-0.4166666666666667, subjectivity=0.6666666666666666)

In [17]:
blob2 =  TextBlob('But the dessert after the meal was delicious.')
blob2.sentiment

Sentiment(polarity=1.0, subjectivity=1.0)

In [19]:
blob3 =  TextBlob('It is 10:15 AM.')
blob3.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

### Language Detection and Translation


In [21]:
blob=TextBlob('Today is a beautiful day. It looks like tomorrow we are in for bad weather.')
blob.detect_language()

'en'

In [22]:
blobes= blob.translate(to='es')
print(blobes)
blobes.detect_language()

Hoy es un hermoso dia. Parece que mañana nos espera mal tiempo.


'es'

In [23]:
print(blobes.translate(to='en'))

Today is a beautiful day. It seems that tomorrow bad weather awaits us.


## Spell Checking and Correction
* Often unreliable due to not taking context into account

In [24]:
from textblob import Word
word = Word('theyr')

In [25]:
word.spellcheck()

[('they', 0.5713042216741622), ('their', 0.42869577832583783)]

In [26]:
word.correct()  # chooses word with the highest confidence value

'they'

## Normalization: Stemming and Lemmatization
* **Stemming** removes a **prefix** or **suffix** from a word leaving only a **stem**, which **may or may not be a real word**
* **Lemmatization** is similar, but factors in the word’s **part of speech** and **meaning** and results in a **real word**
* Both **normalize** words for analysis
	* Before calculating statistics on words in a body of text, we might convert all words to lowercase so that capitalized and lowercase words are not treated differently. 
* We might want to use a word’s root to represent the word’s many forms. 
	* E.g., treat "program" and "programs" as "program"

In [27]:
word = Word('varieties')
word.stem()

'varieti'

In [28]:
word.lemmatize()

'variety'

## Word Frequencies
* Word Frequencies are vital for a number of practical applications
    * For example, machine-learning methods for computing **similarity between documents** rely on **word frequencies**
    * Word frequencies can also be used in trend analysis.
* Keep in mind that some of the visualizations we performed (word cloud, etc.) are fair game for the programming portion.

In [29]:
from pathlib import Path
#Note: we are reading the text into a string then using it to construct a blob
text = Path('AliceInWonderland.txt').read_text(encoding='UTF8') 
blob = TextBlob(text) 

In [30]:
print(blob.word_counts['alice'])
print(blob.word_counts['rabbit'])
print(blob.word_counts['tea-party'])

404
48
3


## Stop Word Removal
* Many common words are largely unimportant for most machine-learning tasks.
* We saw how the `nltk` library could be used to remove stop words from text. 

In [31]:
from nltk.corpus import stopwords
stops = stopwords.words('english') #We're looking specifically for English stop words
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [32]:
blob=TextBlob('Today is a beautiful day. It looks like tomorrow we are in for bad weather.')
usablewords = [word for word in blob.words if word not in stops]
print(usablewords)

['Today', 'beautiful', 'day', 'It', 'looks', 'like', 'tomorrow', 'bad', 'weather']


# Readability
* Readability is a measure of how easy a piece of text is to read, and can tell you what level of education someone will need to be able to read a piece of text easily. 

* We looked at `textatistic` for establishing readability scores, which is extremely easy to use, but not always reliable.


In [33]:
#Create Textatistic object for the text
from textatistic import Textatistic
readability = Textatistic(text)
%precision 3
readability.dict()

{'char_count': 133198,
 'word_count': 30036,
 'sent_count': 1837,
 'sybl_count': 36343,
 'notdalechall_count': 6107,
 'polysyblword_count': 999,
 'flesch_score': 87.875,
 'fleschkincaid_score': 5.065,
 'gunningfog_score': 7.871,
 'smog_score': 7.342,
 'dalechall_score': 7.658}

# Named Entity Recognition and spaCy
* **Named entity recognition** attempts to locate and categorize items and keywords of particular significance. 
    * **dates**, **times**, **quantities**, **places**, **people**, **things**, **organizations** and more can serve as such categories
* Named entity recognition can serve as a basis for sophisticated machine learning and data mining applications.
* We used `spaCy` in class to perform named entity recognition.

In [34]:
import spacy
nlp_lg = spacy.load('en_core_web_lg') 
spadoc = nlp_lg('In 1994, Tim Berners-Lee of the United Kingdom ' + 
    'founded the WWW Consortium (W3C), devoted to ' +
    'developing online technologies in 1994')
for entity in spadoc.ents:
    print(f'{entity.text}: {entity.label_}')

1994: DATE
Tim Berners-Lee: PERSON
the United Kingdom: GPE
the WWW Consortium: ORG
1994: DATE


# Similarity Detection with spaCy
* Documents are often gauged in terms of their relative **similarity** based on a variety of measures.
* We used spaCy to perform (somewhat unreliable) document similarity tests.
* Note that more reliable similarity detection methods are available, but tend to be relatively specific to different text types.
* A similarity of 1 indicates the books are extremely similar, while 0 indicates they aren't similar at all.

In [35]:
#Create the docs for the two pieces
from pathlib import Path
document1 = nlp_lg(Path('AliceInWonderland.txt').read_text(encoding='UTF8'))
document2 = nlp_lg(Path('EdwardTheSecond.txt').read_text(encoding='UTF8'))
docsim=document1.similarity(document2)
print(f'AIW to ET2nd similarity is {docsim:.3}.')

AIW to ET2nd similarity is 0.949.


# Topic 3: Tweet Mining and Geocoding (30-34)
* Twitter is a popular big-data source for making predictions.
    * Fortune 500 companies often use Twitter as a megaphone for PR purposes or advertisment for new/modified technology.
    * Individuals use Twitter to voice satisfaction or dissatisfaction with particular people, companies, or technology.
    * The degree to which a subject or event trends provides a potential indication of its widespread impact.
    * **Sentiment** in tweets plays an enormous factor in how people gauge public response to current events, actions, and politics. 
* Geocoding is the process of transforming an address, description of a location, or landmark/location name, into a real location on the surface of the earth (usually in longitude/latitude format)
* Geocoding can help isolate sentiment and other specifics in tweets within particular regions.

## Note: You will NOT have to connect to the Twitter API for your final in any format. 
* **However**, you will be expected to understand the basics how the services we utilized work, as well as the Python libraries we utilized to connect to the API.
* As well, you may be expected to load pre-saved tweets and perform some sort of analysis/visualization on them during the programming portion of the exam.

# What’s in a Tweet? 
* Twitter API methods return **JSON (JavaScript Object Notation)** objects
* This is the conventional standard text-based **data-interchange format** 
* Objects are represented as **collections of name–value pairs** (like dictionaries)
* Commonly used in web services, as it both human and computer readable.

In [36]:
import json
with open('savedtweets.json') as json_file:
    tweets_from_json = json.load(json_file)
for tweet in tweets_from_json:
    print(tweet['text'])
    print('')


A Salute to China's COVID Heroes https://t.co/RxSta4MDci @MartyMakary #tcot

BBC News - Covid: Under-30s offered alternative to AstraZeneca jab
https://t.co/YTqeHppsq1

@winchesthearts @jacksbees I firmly believe the usual makeup artist wasn't there because of covid &amp; they got an 18… https://t.co/96fnhftoRG

L.A.’s young and healthy head to Bakersfield for COVID-19 vaccine https://t.co/SHOQzxlpvU

Today is #WorldHealthDay 🌍

🔴#COVIDー19 crisis negatively impacts mental health
🔴#LGBTI persons are disproportionate… https://t.co/d0pNAqgqyr

#breaking #breakingnews
Covid: Under-30s offered alternative to AstraZeneca jab https://t.co/sI1YILqVbD

AstraZeneca Covid vaccine possibly linked to blood clot events: EU drug regulato https://t.co/LQFnb1XbeS

I got COVID https://t.co/jpdBiSJbTE

IM FINALLY COVID FREE I TESTED NEGATIVE TODAY AFTER 11 DAYS🎉

This FEMA clinic in Norfolk says it's taking walk-in COVID vaccinations. 

For anyone 16 years or older. If you nee… https://t.co/7t5Z38xeMV

O

# Cleaning/Preprocessing Tweets for Analysis
* The **tweet-preprocessor** library can be used to automatically remove a variety of tweet components and special characters (emojis, URLs, hashtags, etc.)
* BY default, all available elements will be removed, or `set_options` can be used to narrow things down.

In [37]:
import preprocessor as p
#p.set_options(p.OPT.URL, p.OPT.RESERVED)
for tweet in tweets_from_json:
    print(p.clean(tweet['text']))
    print('');

A Salute to China's COVID Heroes

BBC News - Covid: Under-30s offered alternative to AstraZeneca jab

I firmly believe the usual makeup artist wasn't there because of covid &amp; they got an

L.A.s young and healthy head to Bakersfield for COVID-19 vaccine

Today is crisis negatively impacts mental health persons are disproportionate

: Under-30s offered alternative to AstraZeneca jab

AstraZeneca Covid vaccine possibly linked to blood clot events: EU drug regulato

I got COVID

IM FINALLY COVID FREE I TESTED NEGATIVE TODAY AFTER DAYS

This FEMA clinic in Norfolk says it's taking walk-in COVID vaccinations. For anyone years or older. If you nee

Ontario Covid Update

/- was pre-COVID alao?!

You should have controlled covid. Still people are gathering in big rally. How stupid is that. Doesnt ma

Im waiting to see how a friend who did get Covid, but wasnt hospitalized, gets on.

Grateful for all our is doing for the American people in the fight against We need to cooperate to th

Covid:

# Initializing a Credentialed API
* Note that while you won't have to perform this task during the exam, you should know how it works.

In [38]:
import os
import tweepy
auth = tweepy.OAuthHandler(os.environ['APIK'],
                           os.environ['APISK']) #Set up the app keys
auth.set_access_token(os.environ['ACCTO'],
                      os.environ['ACCTOS']) #set up your user tokens
api = tweepy.API(auth, wait_on_rate_limit=True, 
                 wait_on_rate_limit_notify=True) #ALWAYS use wait_on_rate_limit to prevent yourself from getting blocked

# Getting Information About a Twitter Account
* `API` object’s **`get_user` method** returns a **`tweepy.models.User` object** containing information about a specific user’s Twitter account

In [39]:
uofl = api.get_user('uofl')
print(uofl.screen_name)
print(uofl.id) 
print(uofl.status.text)

uofl
39566272
Yes, that is snow greeting us on this last day of spring semester classes (and less than two weeks from the Derby).… https://t.co/KwCbAYC7GM


## Tweepy Cursors: Retrieving Collections of Objects  
* Tweepy cursors are required when retrieving **multi-page** results (i.e. each individual API call returns a _subset_ of available results)


In [40]:
cursor = tweepy.Cursor(api.followers, screen_name='uofl', count=200)
followers = []
for account in cursor.items(400):  # request 2 pages of followers
    followers.append(account.screen_name)


In [41]:
print('Followers:', 
      ' '.join(sorted(followers, key=lambda s: s.lower())))

Followers: 009HALF51 3bhady99 502_matthew_ _______mkultra _lukeshope_ _prettyboy23 _Symbiont_ _teddybearrr_ A_Will28 AadenSean AcademicTimes ACE4923 adept_knight ahmed11ali AirleaW airzbychu8 ajhanainewton Ajo1luv AKSingh31068111 alejandrayuleem AlexandraHopeM AlexBeta16 AlexWilson65 alyssah83854847 ambermdoss Amy43844355 AmyLawyer anasMEJI AndrewTBonasera AnesehAlvanpour Anthony14664936 Artemis01491079 auggiedoggie011 AussiesAreFun AustinVerity1 ayannamonte Azygos1000 b_shearin BClark99536365 BeeShah_ BennyJ_Allen bfarias Bigz723 billyth77982628 bjbuffin boys_laundry bratta_tammy BRZpack BU2FUL88 CadeChristens12 cam_rene27 Cameron14369530 camrynfosterr CandyMa60281917 cantonious CAR_RCC_Journal caraline_new cardinal1987 CardinalReality CardinalRn21 CardsByRon cardsjosh37 CardsTyler caredotai CarinaOnAir CarolineMcMurry cathyf22 catwoman92 CHARLES_T79 chemicalmafia88 cheyennegvowell Chrisgioia3 ChrisLCarp Christo37677588 CitizensHS CJack55850201 clairechaussee9 clfrench73 cluettLaw Coa

## Alternate Approach: Getting Follower IDs Rather Than Followers
* Can get **many more Twitter IDs** by calling **`followers_ids` method** instead of **followers**
* In conjunction with the **`lookup_users`** method, we can retrieve information on many more (10x+) followers than with the `followers` method. 
* _This serves as an important example to emphasize there are multiple approaches to obtaining desired information using Twitter API, and not all are  equally as efficient._

## Getting a User’s Recent Tweets 
* We can use the `API` object's **`user_timeline`** method to obtain a user's recent tweets.

In [42]:
# Note: should use Cursor if getting more than max tweets per call
nasa_tweets = api.user_timeline(screen_name='uofl', count=5)
for tweet in nasa_tweets:
    print(f'{tweet.user.screen_name}: {tweet.text}\n')

uofl: Yes, that is snow greeting us on this last day of spring semester classes (and less than two weeks from the Derby).… https://t.co/KwCbAYC7GM

uofl: There’s nothing quite like spring on campus ❤️ https://t.co/6BC63M6dBO

uofl: RT @GoCards: ✔️ Student-Athletes
✔️ Coaches
✔️ Louie

Everyone is doing their part &amp; getting vaccinated at Cardinal Stadium thanks to the a…

uofl: The @UofLbiz Project on Positive Leadership has launched a smartphone app to help new and experienced leaders hone… https://t.co/ZPmOLCxWHD

uofl: @shessolouky @dmlang04 Beautiful day to flash some Ls!



# Searching for Recent Tweets with the Search API
* Can return up to 100 recent tweets within the previous seven days
* A query string is almost always used to look for specific keywords or other elements

In [43]:
tweets = api.search(q='Software', count=10)
for tweet in tweets:
    print(f'[{tweet.user.screen_name}:', end=' ')
    print(f'{tweet.text}]')
    print('')


[Tricentis: We're excited to announce the hire of @KThompsonSWI as Tricentis CEO and chairman of the board! Kevin is a seasoned… https://t.co/JfjlbAHvrK]

[TheIP100: Congrats to one of our 2019 top-20 entrants, @Candidate_ID, who have secured £1.3m of funding! This Glasgow-based s… https://t.co/duDq2PegHo]

[andyfusion1: Does anybody know of any Senior Software Engineers looking for a new opportunity at the moment? Remote with 1 day p… https://t.co/hvssMGTiaw]

[scottbw: @erikkain I work in Open Source and increasingly have to deal with folks getting upset that the freedom to use soft… https://t.co/WwTJcueqyj]

[pitchprint: New Feature Alert ‼️
#activityhistory #PrintIndustry #software #web2print https://t.co/RzlQghk2xZ]

[Avaaz: ⚠️Attention, Apple users! 🍎  
Any day now, Apple’s new software update will allow users to opt out of cross-app tra… https://t.co/kYLYpKZCZZ]

[Brainlly1: RT @CionekJackson: Neurogames: Using Photon Network to build Multiplayer Serius Games applied to Scientif

# Spotting Trends with Twitter Trends API
* We can use the `trends_place` api to locate regional trends according to a WOEID
    * The WOEID represents a code for the region of interest (1=worldwide)

In [44]:
lou_trends_comp = api.trends_place(id=2442327)
lou_trends = lou_trends_comp[0]['trends']
lou_trends_baseline = [t for t in lou_trends if t['tweet_volume']] #retrieve only those trends with >10k tweet volume


In [45]:
from operator import itemgetter 
lou_trends_baseline.sort(key=itemgetter('tweet_volume'), reverse=True) 
for trend in lou_trends_baseline:
    curtname=trend['name']
    curtvol = trend['tweet_volume']
    print(f'{curtname:>20} {curtvol:>10}')

    #print(f'{trend['name']:>20} {trend['tweet_volume']:>10}')



               Knife     339982
                Ohio     337401
               Mario     205094
       #makhiabryant     153912
          FOUR TIMES     102067
                STAB      87648
      Tucker Carlson      80086
   Blue Lives Matter      46342
               Derby      39526
       Stacey Abrams      27948
   #wednesdaythought      27023
    Kyle Rittenhouse      26564
          All Access      23090
  Justice Department      23052
                ASHE      18135
      Good Wednesday      17016
Minneapolis Police Department      15788
            Hump Day      14784
   Nobel Peace Prize      12508
            Kentucky      12369
       Mother Nature      12185
             McEnany      11652
              Tasers      10495
          AG Garland      10335


## Other Twitter Concerns
* We created a derived version of the `TweetListener` class (`SentimentListener`) to monitor for streaming tweets on a particular issue, and measure the associated sentiment of each tweet. 
* We used the `**geopy**` library and a web-service to find latitude and longitude for tweets.
    * These locations were then plotted on a map using `folium`.

# Topic 4: Machine Learning (Sessions 35-38)
* We observed several machine learning tasks in action on a variety of datasets.
* We performed several visualizations for the associated tasks.
* We looked at concerns regarding training and testing (overfitting, cross-validation testing etc.) along the way. 

## ML Task 1: Classification

* **Classification algorithms predict the discrete classes (categories) to which samples belong.**
    * Classification is the first of two primary **supervising learning** tasks.
    * **Binary classification** uses two classes, such as “spam” or “not spam” in an email classification application. 
    * **Multi-classification** uses more than two classes, such as the 10 classes, 0 through 9, in the Digits dataset.     
* We used the **Digits dataset** bundled with **scikit-learn** to explore classification. 


## ML Task 2: Regression
* **Regression problems** involve prediction of a **continuous output** target variable as a function of other variables.
* Regression is the second of the two primary supervised learning tasks.
* This leads to a conceptually and computationally distinct challenge in comparison to classification problems.
* Broadly speaking, regression approaches can be broken apart into **linear regression** and **non-linear regression** analyses.
* For the purposes of this class, we solely focused on linear regression through the assistance of `sklearn`'s **`LinearRegression` estimator**.
    * We performed **simple linear regression** (tying a single independent variable to the target variable) for the January month Temperature dataset.  
    * We then performed **multiple linear regression** to tie multiple independent variables from the California dataset to median house value.

## ML Task 3: Unsupervised Learning
* We used `t-SNE` conversion to visualize a large scale dataset (the digits dataset with 64 dimensions) in 2 dimensions.
    * We further explored how well the visualization breaks apart the different digits. 
* We then performed `K-means` clustering on the **Iris** dataset.
    We focused on determining what classes our unsupervised clustering could break apart different Iris species successfully.
* **Note: You can expect similar visualization requirements on the programming portion of the exam!**