# Clean data

In this notebook, we are cleaning the data and converting it into AlkgieV1 Entities.

This step is important in generating ids and further processing the data into a coherent set of domain specific entities that the Alkgie project can use.

## Flatten the data

In [1]:
#r "nuget:FSharp.Data"

open FSharp.Data
type ScrappedData = JsonProvider<"../data/scrapped/scrapped-dataset.json">
let datasets = ScrappedData.Load("../data/scrapped/scrapped-dataset.json")

type FlattenedData = {
    Link: Option<string>
    Name: Option<string>
    Description: string
    Headers: string[]
    AwesomeList: string
}

let flattenedData = 
    datasets
    |> Seq.collect (fun dataset ->
        dataset.Data
        |> Seq.map (fun item -> 
            {
                Link = item.Link
                Name = item.Name
                Description = item.Description
                Headers = item.Headers
                AwesomeList = dataset.Filename
            }
        )
    )
    |> Seq.toList
flattenedData


## Transform flattened data into AlkgieV1 entities.

AlkgieV1 entities is the name I'm giving to the data format / schema that this project produces. It's not a final format, hence the V1.

Borrowing inspiration from thematic analysis, these entities define their names as "codes", as in an identified code in thematic analysis.

Themes are groupings of codes that form a cohesive whole.

In most instances I expect codes to be specific software products such as programming lanugages (like f#, javascript), whereas I expect themes to be more related to concepts (such as the concept of programming languages itself).

We are also assigning an unique Id at this stage to make it easier to graph things. These Ids are non-stable between versions of the dataset. Hopefully a future version of this project will find a way to keep ids stable between versions.

So essentially this is the Relationship Classification (RC) stage of this data analysis project.

In [2]:
type AlkgieV1Relation = {
    Id: Guid
    Source: string
}

type AlkgieV1EntityTypes =
    | Theme
    | Code

type AlkgieV1Entity = {
    Id: Guid
    Relations: AlkgieV1Relation[]
    Link: Option<string>
    Name: Option<string>
    Description: string
    Source: string
    EntityType: AlkgieV1EntityTypes
}

// Collect headers as Themes
let themePartials =
    flattenedData
    |> Seq.collect (fun item -> 
        // For each item, create a sequence of tuples (header, source)
        item.Headers |> Seq.map (fun header -> (header, item.AwesomeList))
    )
    |> Seq.distinct
    |> Seq.map (fun (header, source) -> 
        {|
            Id = Guid.NewGuid()
            Source = source
            Name = header
        |}
    )

// Helper functions
let getEntityType source name =
    match name with
    | None -> Code
    | Some actualName ->
        if themePartials |> Seq.exists (fun theme -> theme.Source = source && theme.Name = actualName) then
            Theme
        else
            Code

let getRelations source headers =
    themePartials
    |> Seq.filter (fun theme -> theme.Source = source && headers |> Seq.exists (fun header -> header = theme.Name))
    |> Seq.map (fun theme -> { Id = theme.Id; Source = theme.Source })

let getId source name =
    match name with
    | None -> Guid.NewGuid()
    | Some actualName ->
        match themePartials |> Seq.tryFind (fun theme -> theme.Source = source && theme.Name = actualName) with
        | Some(theme) -> theme.Id
        | None -> Guid.NewGuid()

 // Generate entities
let entities =
    flattenedData
    |> Seq.map (fun item -> 
        {
            Id = getId item.AwesomeList item.Name
            Relations = getRelations item.AwesomeList item.Headers |> Seq.toArray
            Link = item.Link
            Name = item.Name
            Description = item.Description
            Source = item.AwesomeList
            EntityType = getEntityType item.AwesomeList item.Name
        }
    )

entities

## Save cleaned data

Saving the results of data cleaning

In [6]:

// This is a temporary hack to get around not being able to seralize Discriminated Unions
let entityTypeToString entityType =
    match entityType with
    | Theme -> "Theme"
    | Code -> "Code"

let temp =
    entities
    |> Seq.map( fun entity -> {|entity with EntityType = entityTypeToString entity.EntityType|})


// Actual saving
open System.IO
open System.Text.Json

let filePath = "../data/cleaned/cleaned-dataset.json"
let json = JsonSerializer.Serialize(temp, JsonSerializerOptions(WriteIndented = true))

let directoryPath = Path.GetDirectoryName(filePath)
if not <| Directory.Exists(directoryPath) then
    Directory.CreateDirectory(directoryPath) |> ignore

File.WriteAllText(filePath, json)