## Project Exercise: Multi-Modal Data Analysis (Approx. 2/3 Hours)

*Objective*: Apply your data manipulation and visualization skills to analyze and compare features from both a text dataset (20 Newsgroups) and an image dataset (CIFAR-10).

In [1]:
! pip install datasets

Collecting datasets
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting filelock (from datasets)
  Downloading filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.9.0,>=2023.1.0 (from fsspec[http]<=2025.9.0,>=2023.1.0

In [7]:
from datasets import load_dataset

# 1. Download the dataset
dataset = load_dataset("cifar10")

# 2. Sets the format of the dataset to 'numpy' for the 'train' split. Don't 
train_data_np = dataset['train'].with_format('numpy')[:]
test_data_np = dataset['test'].with_format('numpy')[:]

# 3. Directly access the columns, which are now the NumPy arrays you want
X = test_data_np['img']
y = test_data_np['label']

### Part 1: Image Analysis with CIFAR-10

Here, we will analyze the color composition of images from the CIFAR-10 dataset, which contains 10 classes of small images (airplanes, cars, birds, etc.). For each class and color channel, produce plots to describe the average color intensity of the pixels. 

#### Guiding Steps:

##### 1- Load the Data (guided):

The easiest way to get this data as NumPy arrays is using a high-level library. We'll use ```datasets``` from Hugging Face (a very powerful platform for machine learning). 
. This will return training images, training labels, testing images, and testing labels. We only need the training set for this exercise.
Check the shapes and types of the images and labels. This part is a bit tricky, so use the block of code described above!

##### 2 - Organize the Data:

Create a list of the 10 class names in the correct order: ```['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']```.
For each of the 10 classes, you need to isolate all the images belonging to that class.

(Hint: Use boolean indexing with NumPy to find the indices for each class, then use those indices to select the images.)

##### 3 - Calculate Color Channel Distributions:

For each class of images:

 - Separate the Red, Green, and Blue channels. The last dimension of the image array corresponds to these channels (0=R, 1=G, 2=B).
 - For each channel (R, G, B), flatten all the pixel values for that channel across all images in the class into a single 1D array.

Use ```numpy.histogram``` to calculate the distribution of pixel intensities (from 0 to 255) for each channel. Use 256 bins for a full distribution.

##### 4 -Visualize the Results:

Create a grid of subplots (one for each of the 10 image classes). Plot the distributions for the Red, Green, and Blue channels as three separate lines on the same axes. Name axis and use the proper plotting parameters to display your final grid of plots cleanly.

### Part 2: Text Analysis with 20 Newsgroups

In this part, we will load text from several newsgroup categories and determine the most characteristic words for each, filtering out common "stop words."

#### Guiding Steps:

##### 1- Load the Data 

Use the import function of the dataset ```fetch_20newsgroups``` from sklearn.datasets.
The loaded object contains the text data (```.data```) and the numerical labels (```.target```). The category names are in ```.target_names```.

##### 2 - Process and Clean the Text:

Create a simple list of common English "trivial words" to ignore (e.g., 'have', 'are', 'they'...).
Create a dictionary to hold the word counts for each category.
Loop through each unique newsgroup category. For each category:

- Gather all the text documents belonging to that category (suggestion: convert everything to lowercase).
- Filter the words: keep only words that are purely alphabetic and longer than 3 characters, and are not in your stop word list.


##### 3 - Calculate Word Frequencies:

- For each category, use your filtered list of words to calculate the frequency of each word.
(Hint: The collections.Counter object is perfect for counting the frequency of items in a list.)
- Find the 10 most common words for each category.
- Combine all words from all categories in a single group.

##### 4 - *Visualize the Results*:

Using ```matplotlib.pyplot```, create a grid of subplots (one for each category).
Create a chart showing the counts of the most frequent words for that category.
Name axis and use the proper plotting parameters to display your final grid of plots cleanly.