# 20 Newsgroups data import script for *Google Cloud AutoML Natural Language*

This notebook downloads the [20 newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) using scikit-learn. This dataset contains about 18000 posts from 20 newsgroups, and is useful for text classification. The script transforms the data into a pandas dataframe and finally into a CSV file readable by [Google Cloud AutoML Natural Language](https://cloud.google.com/natural-language/automl).

## Imports

In [0]:
import numpy as np
import pandas as pd
import csv

from sklearn.datasets import fetch_20newsgroups

## Fetch data

In [2]:
newsgroups = fetch_20newsgroups(subset='all')

df = pd.DataFrame(newsgroups.data, columns=['text'])
df['categories'] = [newsgroups.target_names[index] for index in newsgroups.target]
df.head()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Unnamed: 0,text,categories
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,rec.sport.hockey
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,comp.sys.ibm.pc.hardware
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,talk.politics.mideast
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,comp.sys.ibm.pc.hardware
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,comp.sys.mac.hardware


## Clean data

In [3]:
# Convert multiple whitespace characters into a space
df['text'] = df['text'].str.replace('\s+',' ')

# Trim leading and tailing whitespace
df['text'] = df['text'].str.strip()

# Truncate all fields to the maximum field length of 128kB
df['text'] = df['text'].str.slice(0,131072)

# Remove any rows with empty fields
df = df.replace('', np.NaN).dropna()

# Drop duplicates
df = df.drop_duplicates(subset='text')

# Limit rows to maximum of 100,000
df = df.sample(min(100000, len(df)))

df.head()

Unnamed: 0,text,categories
16550,From: will@futon.webo.dg.com (Will Taber) Subj...,soc.religion.christian
13121,"From: zowie@daedalus.stanford.edu (Craig ""Powd...",sci.space
15519,From: boyle@cactus.org (Craig Boyle) Subject: ...,rec.autos
10389,From: erics@netcom.com (Eric Smith) Subject: R...,rec.sport.baseball
15554,Organization: University of Illinois at Chicag...,talk.politics.guns


## Export to CSV

In [0]:
csv_str = df.to_csv(index=False, header=False)

with open("20-newsgroups-dataset.csv", "w") as text_file:
    print(csv_str, file=text_file)

You're all set! Download `20-newsgroups-dataset.csv` and import it into [Google Cloud AutoML Natural Language](https://cloud.google.com/natural-language/automl).

If you are using [Google Colab](https://colab.research.google.com), you will find the file in the left navbar:


*   From the menu, select **View > Table of Contents**
*   Navigate to the **Files** tab
*   Find the file in `/content` directory.


