# Text Classification using scikit-learn in NLP
- from [GeeksForGeeks Text Classification using scikit-learn in NLP](https://www.geeksforgeeks.org/nlp/text-classification-using-scikit-learn-in-nlp/)
- for *reasons* I need to figure out how I *would* have categorized different types of vehicle services based on the description.

## Text Classification
- assign predefined categories or labels to text documents
- involves automated sorting and organizing of textual data
- extract valuable information and insights from large volumes of text

## Import Dataset
- fetch the data from the `20 newsgroups` dataset from `sklearn`
- it's a collection of documents from 20 different news groups
- specifically grab documents about baseball and space
- this ... appears to take a while. How big is this data set?

In [21]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

In [22]:
# fetch the dataset of news documents
newsgroups = fetch_20newsgroups(
    subset='all', # get both the test and train datasets
    categories=[
        'rec.sport.baseball', # get baseball sport documents
        'sci.space', # get science space documents
    ], 
    shuffle=True, # shuffle the data
    random_state=42 # shuffle in a predictable order
)

In [None]:
# split off the data (text) from the target (label)
data = newsgroups.data
target = newsgroups.target

# Create a DataFrame for easy manipulation
df = pd.DataFrame({'text': data, 'label': target})
df

Unnamed: 0,text,label
0,From: mss@netcom.com (Mark Singer)\nSubject: R...,0
1,From: cuz@chaos.cs.brandeis.edu (Cousin It)\nS...,0
2,From: J019800@LMSC5.IS.LMSC.LOCKHEED.COM\nSubj...,0
3,From: tedward@cs.cornell.edu (Edward [Ted] Fis...,0
4,From: snichols@adobe.com (Sherri Nichols)\nSub...,0
...,...,...
1976,From: msb@sq.sq.com (Mark Brader)\nSubject: Re...,1
1977,From: clgs11@vaxa.strath.ac.uk\nSubject: Jack ...,0
1978,From: 18084TM@msu.edu (Tom)\nSubject: Level 5?...,1
1979,From: snichols@adobe.com (Sherri Nichols)\nSub...,0


### explore the data
- it's got 1981 rows
- the articles are shuffled together
- a label of `0` indicates that it's about `baseball`
- a label of `1` indicates that it's about `space`

In [26]:
print("articles labled 0 are about baseball")
print("label:", df.iloc[0]["label"])
print("text:", df.iloc[0]["text"])

articles labled 0 are about baseball
label: 0
text: From: mss@netcom.com (Mark Singer)
Subject: Re: Young Catchers
Article-I.D.: netcom.mssC52qMx.768
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
Lines: 86

In article <7975@blue.cis.pitt.edu> genetic+@pitt.edu (David M. Tate) writes:
>mss@netcom.com (Mark Singer) said:
>>
>>We know that very, very few players at this age make much of an impact
>>in the bigs, especially when they haven't even played AAA ball.  
>
>Yes.  But this is *irrelevant*.  You're talking about averages, when we
>have lots of information about THIS PLAYER IN PARTICULAR to base our
>decisions on.

Do you really have *that* much information on him?  Really?

>Why isn't Lopez likely to hit that well?  He hit that well last year (after
>adjusting his stats for park and league and such); he hit better (on an
>absolute scale) than Olson or Berryhill did.  By a lot.

I don't know.  You tell me.  What percentage of players reach or 
excee

In [None]:
print("articles labled 1 are about space")
print("label:", df.iloc[1980]["label"])
print("text:", df.iloc[1980]["text"])

articles labled 1 are about space
label: 1
text:
 From: mancus@sweetpea.jsc.nasa.gov (Keith Mancus)
Subject: Re: Lindbergh and the moon (was:Why not give $1G)
Organization: MDSSC
Lines: 32

In article <1r3nuvINNjep@lynx.unm.edu>, cook@varmit.mdc.com (Layne Cook) writes:
> All of this talk about a COMMERCIAL space race (i.e. $1G to the first 1-year 
> moon base) is intriguing. Similar prizes have influenced aerospace 
> development before. The $25k Orteig prize helped Lindbergh sell his Spirit of 
> Saint Louis venture to his financial backers.
> But I strongly suspect that his Saint Louis backers had the foresight to 
> realize that much more was at stake than $25,000.
> Could it work with the moon? Who are the far-sighted financial backers of 
> today?

  The commercial uses of a transportation system between already-settled-
and-civilized areas are obvious.  Spaceflight is NOT in this position.
The correct analogy is not with aviation of the '30's, but the long
transocean voyages of 